SEM Labs

Handcrafted Pixels, Code & Title Tags

Enableing Chinese, Arabic and Other High Unicode in WordPress Slugs

A while back someone contacted me about their WordPress blog borking out on some of their posts. After a bit of poking about it became apparent that this was because WordPress doesn't allow high Unicode characters in the URL. At first, I thought this would just be a change to a line in .htaccess, but there are a couple of other things that need to be changed too.

Here are the instructions to allow your WordPress blog to have high Unicode like Han or Arabic in the URL:

Edit .htaccess

First of all open up your .htaccess. You will find a block something like this:

You will need to replace the forth and fifth lines with one of the following:

Edit Post Slugs

The first solution will allow URLs containing high Unicode characters. While the second will allow all URLs to go though. This will re-route URLs containing those characters to the index.php so WordPress can deal with them. However, there are a couple of other things you need to do...

  • Open up the file wp-includes/query.php
  • Search for the line: $q['name'] = sanitize_title($q['name']);
  • and replace this with: $q['name'] = addslashes(strip_tags($q['name']));

This will stop WordPress from converting post slugs to UTF-8 character codes. So your posts will now be loaded from the database.

Edit Page Slugs

  • Open wp-includes/post.php
  • Search for the function called get_page_by_path
  • Copy the following to the top of the function:

This will allow any pages that contain high Unicode characters in their slugs to be loaded.

Edit Category Slugs

  • Open wp-includes/category.php
  • Search for the function called get_category_by_path
  • Comment out this line: $category_path = rawurlencode(urldecode($category_path)); - you do this by adding a hash at the beginning of the line
  • After the following line: $full_path .= ( $pathdir != '' ? '/' : '' ) . sanitize_title( $pathdir ); paste this:

This will allow any categories that contain high Unicode characters in their slugs to be loaded.

Comments

webtech Replied at 5:25 AM on 12 Feb 2009

I've always wondered about this, not really because I care about speaking Chinese or Arabic, but showing translated text is important for me. Thanks!

Rima Replied at 12:10 PM on 27 Feb 2009

Great Post!

Actually I am developing a project with symfony, and I have a problem with my slugify class that it does not support any High Unicode..

Here is the function code:

static public function slugify($text)

{

// replace all non letters or digits by -

$text = preg_replace('/\W+/', '-', $text);

// trim and lowercase

$text = strtolower(trim($text, '-'));

return $text;

}

Any idea of how to make it support high unicode?

Rima Replied at 4:25 PM on 28 Feb 2009

Hi

I could find the solution for this by checking if the text is ASCII..

static public function slugify($text)

{

// replace all non letters or digits by -

$text = preg_replace('/\W+/', '-', $text);

// trim and lowercase

if( mb_detect_encoding($text, 'ASCII') ){

$text = strtolower(trim($text, '-'));

}

return $text;

}

But I still can't get the high unicode in the URL.. .htaccess rewrite rule looks like this

RewriteRule ^$ index.html [QSA]

RewriteRule ^([^.]+)$ $1.html [QSA]

David Replied at 11:13 AM on 1 Mar 2009

Hi, this is the function I use to convert string to "search engine friendly URLs":


setlocale( LC_CTYPE, 'en_GB.utf8' );
header( 'Content-Type: text/html; charset=utf-8' );

function translit( $str )
{
	$str_trans = iconv( 'UTF-8', 'ASCII//TRANSLIT', $str );
	for ( $i = 0; $i < strlen( $str_trans ); $i++ )
	{
		$chr1 = $str_trans[$i];
		$chr2 = mb_substr( $str, $i, 1 );
		$str_res .= ( $chr1 == '?' ) ? $chr2 : $chr1;
	}
	return $str_res;
}

function strtourl( $str, $seperator = '-', $case = MB_CASE_LOWER, $translit = true ) {
	
	# Validate and sanitise input
	if( strlen ( $str ) < 1 )
		trigger_error( 'First argument must be one or more characters', E_USER_ERROR );
	if( !in_array( $seperator, array( "-", "_", "+" ) ) )
		trigger_error( 'Second argument must by a hyphen, underscore or plus sign', E_USER_ERROR );
	$str = trim( $str );
	$str = strip_tags( $str );
	$flags = ( int ) $case;
	
	# Transliterate
	if( $translit == true )
	{
		if( function_exists( 'transliterate' ) )
			$str = transliterate( $str, array( 'han_transliterate', 'jamo_transliterate', 'greek_transliterate', 'hebrew_transliterate', 'cyrillic_transliterate' ), 'UTF-8', 'UTF-8' );
		if( function_exists( 'iconv' ) )
			$str = translit( $str );
	}
	
	var_dump( $str );
	
	# Clean ampersands, whitespace and punctuation
	$patterns = array( "/[&|&#][a-zA-Z0-9]{2,7}[;]/", "/\s{2,}/", "/[\t\r\n\v\f]+/", "/[\"`¬!£$%^%&*()={}\[\]#~'@;:.>,<|\/?\\\]+/" );
	$replacements = array( "", " ", " ", "" );
	$str = preg_replace( $patterns, $replacements, $str );
	
	# Replace and spaces with seperators
	$str = str_replace( ' ', $seperator, $str );
	$str = preg_replace( "/$seperator{2,}/", $seperator, $str );
	
	# Convert case
	$str = mb_convert_case( $str, $case, 'UTF-8' );
	
	return $str;
		
}

echo strtourl( 'slug semlabs.co.uk Some text & and AmPersand', '-', MB_CASE_LOWER, true );
echo strtourl( 'slug semlabs.co.uk Some text & and AmPersand', '-', MB_CASE_TITLE, false );

It has a couple of extra features, namely it uses multi-byte case convert, which changes the case of characters with accents and you can also use transliteration for removing any accents if you just want ASCII URL slugs.

I think the reason the above mod_rewrite rules don't work is because they are for WordPress. Try one of these instead:

RewriteRule ^(.+)$ $1.html [QSA]

RewriteRule ^([a-zA-Z0-9_-\x7f-\xff]+)$ $1.html [QSA]

If you use that transliteration, ensure you have a locale defined. Othewise some characters won't come out properly.

Rima Replied at 12:52 PM on 1 Mar 2009

Hi, thank you for the reply!

Actually my class did better than yours for non ASCII as yours turning the strings into square.. after I modified it to:

if (!(mb_ereg("[^\w\s\.\-]", $text))) {

$text = preg_replace('/\W+/', '-', $text);

$text = strtolower(trim($text, '-'));

}

else

{

$text = str_replace(" ", "-", $text);

$text = trim($text, '-');

}

return $text;

But I need to add to it , the pattern that will clean punctuation as the one you have..

I'll try to apply them to my class

as for .htaccess..

I added this

RewriteRule ^(.+/)$ $1.html [QSA]

and tried this also

RewriteRule ^$ index.html [QSA]

which I took out from your post .. they work ok but the issue is that the url is written in raw utf.. which is really ugly..

Thanks alot

David Replied at 1:48 PM on 2 Mar 2009

The squares you are talking about are probably due to either you not having set a locale, e.g. setlocale( LC_CTYPE, 'en_GB.utf8' );

Or it may be that the font you are using doesn't have those characters available. By raw UFT in the URL do you mean like: %7F or accented characters like ??

Hemant Replied at 10:12 AM on 10 May 2009

Hi,

I am trying to use this solution on WPMU 2.7.1 but my php files look different from what you have written here. Can you please let me know where do I make changes in my installation?

Thanks and Regards,

Hemant

David Replied at 10:34 AM on 10 May 2009

Could you provide more specific details as to what you are having trouble with. Try searching the files using a find facility in a text editor to search for the relevant parts.

Jolon Replied at 9:14 AM on 5 Jun 2009

Hoping someone can help. I run a blog and use Chinese as the main language. I've found that I can't create a "page" and use the name of the page (Chinese text) as the URL for the page. I have no problem with categories being in Chinese, but creating a "page" and using a Chinese URL doesn't work. I assume this is a problem that is mentioned above? I read through the fix but I am a little cautious as to doing it. Wondering if anyone can shed some light?

David Replied at 6:48 PM on 5 Jun 2009

You could try this. I think it does the same thing.

socialpreneur Replied at 5:32 PM on 3 Oct 2009

Could this hack be written into wp plugin? I don't mind fixing .htaccess, but changing these everytime on wp upgrades is kinda...hard work. I really wonder why wordpress teams always never cares about international users. I really hope this is added to core.

joe Replied at 11:22 AM on 23 Dec 2009

I know...the fact that wordpress team isn't aware of this is a real pain for international users. Now, let me try your code. Thanks so much because nobody seems to care about this accept semlabs!!

chris Replied at 8:34 AM on 5 Jun 2010

The big problem is how you know if the user is typing thai, chinese or japanese.. if you dont know the iconv will fail.

Enad Replied at 11:15 AM on 16 Oct 2010

I tried it to the letter, but did not work for me. I am using wordpress MU 3.0.1

DT Replied at 12:24 PM on 18 Oct 2011

In IIS with URL Rewrite, there is a permalink problem with non-ASCII characters so your pages will always show "Nothing Found". The solution that worked for me is here: http://ruslany.net/2009/05/iis-7-url-rewrite-module-support-in-wordpress-28/#comment-1707

In the wp-config.php file add this code:

if (isset($_SERVER['UNENCODED_URL']))$_SERVER['REQUEST_URI'] = $_SERVER['UNENCODED_URL'];

muntam Replied at 2:27 PM on 31 Oct 2012

Thank you David, I appreciate your your effort, it really helped me solving a problem when sharing my arabic post links to facebook

Post Comment

Thin comments left for links will be deleted.

Entry Info

Categories