Enableing Chinese, Arabic and Other High Unicode in WordPress Slugs
A while back someone contacted me about their WordPress blog borking out on some of their posts. After a bit of poking about it became apparent that this was because WordPress doesn't allow high Unicode characters in the URL. At first, I thought this would just be a change to a line in .htaccess, but there are a couple of other things that need to be changed too.
Here are the instructions to allow your WordPress blog to have high Unicode like Han or Arabic in the URL:
Edit .htaccess
First of all open up your .htaccess. You will find a block something like this:
You will need to replace the forth and fifth lines with one of the following:
Edit Post Slugs
The first solution will allow URLs containing high Unicode characters. While the second will allow all URLs to go though. This will re-route URLs containing those characters to the index.php so WordPress can deal with them. However, there are a couple of other things you need to do...
- Open up the file wp-includes/query.php
- Search for the line:
$q['name'] = sanitize_title($q['name']); - and replace this with:
$q['name'] = addslashes(strip_tags($q['name']));
This will stop WordPress from converting post slugs to UTF-8 character codes. So your posts will now be loaded from the database.
Edit Page Slugs
- Open wp-includes/post.php
- Search for the function called
get_page_by_path - Copy the following to the top of the function:
This will allow any pages that contain high Unicode characters in their slugs to be loaded.
Edit Category Slugs
- Open wp-includes/category.php
- Search for the function called
get_category_by_path - Comment out this line:
$category_path = rawurlencode(urldecode($category_path));- you do this by adding a hash at the beginning of the line - After the following line:
$full_path .= ( $pathdir != '' ? '/' : '' ) . sanitize_title( $pathdir );paste this:
This will allow any categories that contain high Unicode characters in their slugs to be loaded.
Comments
I've always wondered about this, not really because I care about speaking Chinese or Arabic, but showing translated text is important for me. Thanks!
Great Post!
Actually I am developing a project with symfony, and I have a problem with my slugify class that it does not support any High Unicode..
Here is the function code:
static public function slugify($text)
{
// replace all non letters or digits by -
$text = preg_replace('/\W+/', '-', $text);
// trim and lowercase
$text = strtolower(trim($text, '-'));
return $text;
}
Any idea of how to make it support high unicode?
Hi
I could find the solution for this by checking if the text is ASCII..
static public function slugify($text)
{
// replace all non letters or digits by -
$text = preg_replace('/\W+/', '-', $text);
// trim and lowercase
if( mb_detect_encoding($text, 'ASCII') ){
$text = strtolower(trim($text, '-'));
}
return $text;
}
But I still can't get the high unicode in the URL.. .htaccess rewrite rule looks like this
RewriteRule ^$ index.html [QSA]
RewriteRule ^([^.]+)$ $1.html [QSA]
Hi, this is the function I use to convert string to "search engine friendly URLs":
It has a couple of extra features, namely it uses multi-byte case convert, which changes the case of characters with accents and you can also use transliteration for removing any accents if you just want ASCII URL slugs.
I think the reason the above mod_rewrite rules don't work is because they are for WordPress. Try one of these instead:
RewriteRule ^(.+)$ $1.html [QSA]
RewriteRule ^([a-zA-Z0-9_-\x7f-\xff]+)$ $1.html [QSA]
If you use that transliteration, ensure you have a locale defined. Othewise some characters won't come out properly.
Hi, thank you for the reply!
Actually my class did better than yours for non ASCII as yours turning the strings into square.. after I modified it to:
if (!(mb_ereg("[^\w\s\.\-]", $text))) {
$text = preg_replace('/\W+/', '-', $text);
$text = strtolower(trim($text, '-'));
}
else
{
$text = str_replace(" ", "-", $text);
$text = trim($text, '-');
}
return $text;
But I need to add to it , the pattern that will clean punctuation as the one you have..
I'll try to apply them to my class
as for .htaccess..
I added this
RewriteRule ^(.+/)$ $1.html [QSA]
and tried this also
RewriteRule ^$ index.html [QSA]
which I took out from your post .. they work ok but the issue is that the url is written in raw utf.. which is really ugly..
Thanks alot
The squares you are talking about are probably due to either you not having set a locale, e.g. setlocale( LC_CTYPE, 'en_GB.utf8' );
Or it may be that the font you are using doesn't have those characters available. By raw UFT in the URL do you mean like: %7F or accented characters like ??
Hi,
I am trying to use this solution on WPMU 2.7.1 but my php files look different from what you have written here. Can you please let me know where do I make changes in my installation?
Thanks and Regards,
Hemant
Could you provide more specific details as to what you are having trouble with. Try searching the files using a find facility in a text editor to search for the relevant parts.
Hoping someone can help. I run a blog and use Chinese as the main language. I've found that I can't create a "page" and use the name of the page (Chinese text) as the URL for the page. I have no problem with categories being in Chinese, but creating a "page" and using a Chinese URL doesn't work. I assume this is a problem that is mentioned above? I read through the fix but I am a little cautious as to doing it. Wondering if anyone can shed some light?
You could try this. I think it does the same thing.
Could this hack be written into wp plugin? I don't mind fixing .htaccess, but changing these everytime on wp upgrades is kinda...hard work. I really wonder why wordpress teams always never cares about international users. I really hope this is added to core.
I know...the fact that wordpress team isn't aware of this is a real pain for international users. Now, let me try your code. Thanks so much because nobody seems to care about this accept semlabs!!
The big problem is how you know if the user is typing thai, chinese or japanese.. if you dont know the iconv will fail.
I tried it to the letter, but did not work for me. I am using wordpress MU 3.0.1
In IIS with URL Rewrite, there is a permalink problem with non-ASCII characters so your pages will always show "Nothing Found". The solution that worked for me is here: http://ruslany.net/2009/05/iis-7-url-rewrite-module-support-in-wordpress-28/#comment-1707
In the wp-config.php file add this code:
if (isset($_SERVER['UNENCODED_URL']))$_SERVER['REQUEST_URI'] = $_SERVER['UNENCODED_URL'];
Thank you David, I appreciate your your effort, it really helped me solving a problem when sharing my arabic post links to facebook