SEM Labs

Handcrafted Pixels, Code & Title Tags

Enableing Chinese, Arabic and Other High Unicode in WordPress Slugs

A while back someone contacted me about their WordPress blog borking out on some of their posts. After a bit of poking about it became apparent that this was because WordPress doesn't allow high Unicode characters in the URL. At first, I thought this would just be a change to a line in .htaccess, but there are a couple of other things that need to be changed too.

Here are the instructions to allow your WordPress blog to have high Unicode like Han or Arabic in the URL:

Edit .htaccess

First of all open up your .htaccess. You will find a block something like this:

You will need to replace the forth and fifth lines with one of the following:

Edit Post Slugs

The first solution will allow URLs containing high Unicode characters. While the second will allow all URLs to go though. This will re-route URLs containing those characters to the index.php so WordPress can deal with them. However, there are a couple of other things you need to do...

  • Open up the file wp-includes/query.php
  • Search for the line: $q['name'] = sanitize_title($q['name']);
  • and replace this with: $q['name'] = addslashes(strip_tags($q['name']));

This will stop WordPress from converting post slugs to UTF-8 character codes. So your posts will now be loaded from the database.

Edit Page Slugs

  • Open wp-includes/post.php
  • Search for the function called get_page_by_path
  • Copy the following to the top of the function:

This will allow any pages that contain high Unicode characters in their slugs to be loaded.

Edit Category Slugs

  • Open wp-includes/category.php
  • Search for the function called get_category_by_path
  • Comment out this line: $category_path = rawurlencode(urldecode($category_path)); - you do this by adding a hash at the beginning of the line
  • After the following line: $full_path .= ( $pathdir != '' ? '/' : '' ) . sanitize_title( $pathdir ); paste this:

This will allow any categories that contain high Unicode characters in their slugs to be loaded.

Comments

webtech Replied at 5:25 AM on 12 Feb 2009

I've always wondered about this, not really because I care about speaking Chinese or Arabic, but showing translated text is important for me. Thanks!

Rima Replied at 12:10 PM on 27 Feb 2009

Great Post!

Actually I am developing a project with symfony, and I have a problem with my slugify class that it does not support any High Unicode..

Here is the function code:

static public function slugify($text)

{

// replace all non letters or digits by -

$text = preg_replace('/\W+/', '-', $text);

// trim and lowercase

$text = strtolower(trim($text, '-'));

return $text;

}

Any idea of how to make it support high unicode?

Rima Replied at 4:25 PM on 28 Feb 2009

Hi

I could find the solution for this by checking if the text is ASCII..

static public function slugify($text)

{

// replace all non letters or digits by -

$text = preg_replace('/\W+/', '-', $text);

// trim and lowercase

if( mb_detect_encoding($text, 'ASCII') ){

$text = strtolower(trim($text, '-'));

}

return $text;

}

But I still can't get the high unicode in the URL.. .htaccess rewrite rule looks like this

RewriteRule ^$ index.html [QSA]

RewriteRule ^([^.]+)$ $1.html [QSA]

David Replied at 11:13 AM on 1 Mar 2009

Hi, this is the function I use to convert string to "search engine friendly URLs":


setlocale( LC_CTYPE, 'en_GB.utf8' );
header( 'Content-Type: text/html; charset=utf-8' );

function translit( $str )
{
	$str_trans = iconv( 'UTF-8', 'ASCII//TRANSLIT', $str );
	for ( $i = 0; $i < strlen( $str_trans ); $i++ )
	{
		$chr1 = $str_trans[$i];
		$chr2 = mb_substr( $str, $i, 1 );
		$str_res .= ( $chr1 == '?' ) ? $chr2 : $chr1;
	}
	return $str_res;
}

function strtourl( $str, $seperator = '-', $case = MB_CASE_LOWER, $translit = true ) {
	
	# Validate and sanitise input
	if( strlen ( $str ) < 1 )
		trigger_error( 'First argument must be one or more characters', E_USER_ERROR );
	if( !in_array( $seperator, array( "-", "_", "+" ) ) )
		trigger_error( 'Second argument must by a hyphen, underscore or plus sign', E_USER_ERROR );
	$str = trim( $str );
	$str = strip_tags( $str );
	$flags = ( int ) $case;
	
	# Transliterate
	if( $translit == true )
	{
		if( function_exists( 'transliterate' ) )
			$str = transliterate( $str, array( 'han_transliterate', 'jamo_transliterate', 'greek_transliterate', 'hebrew_transliterate', 'cyrillic_transliterate' ), 'UTF-8', 'UTF-8' );
		if( function_exists( 'iconv' ) )
			$str = translit( $str );
	}
	
	var_dump( $str );
	
	# Clean ampersands, whitespace and punctuation
	$patterns = array( "/[&|&#][a-zA-Z0-9]{2,7}[;]/", "/\s{2,}/", "/[\t\r\n\v\f]+/", "/[\"`¬!£$%^%&*()={}\[\]#~'@;:.>,<|\/?\\\]+/" );
	$replacements = array( "", " ", " ", "" );
	$str = preg_replace( $patterns, $replacements, $str );
	
	# Replace and spaces with seperators
	$str = str_replace( ' ', $seperator, $str );
	$str = preg_replace( "/$seperator{2,}/", $seperator, $str );
	
	# Convert case
	$str = mb_convert_case( $str, $case, 'UTF-8' );
	
	return $str;
		
}

echo strtourl( 'slug semlabs.co.uk Some text & and AmPersand', '-', MB_CASE_LOWER, true );
echo strtourl( 'slug semlabs.co.uk Some text & and AmPersand', '-', MB_CASE_TITLE, false );

It has a couple of extra features, namely it uses multi-byte case convert, which changes the case of characters with accents and you can also use transliteration for removing any accents if you just want ASCII URL slugs.

I think the reason the above mod_rewrite rules don't work is because they are for WordPress. Try one of these instead:

RewriteRule ^(.+)$ $1.html [QSA]

RewriteRule ^([a-zA-Z0-9_-\x7f-\xff]+)$ $1.html [QSA]

If you use that transliteration, ensure you have a locale defined. Othewise some characters won't come out properly.

Rima Replied at 12:52 PM on 1 Mar 2009

Hi, thank you for the reply!

Actually my class did better than yours for non ASCII as yours turning the strings into square.. after I modified it to:

if (!(mb_ereg("[^\w\s\.\-]", $text))) {

$text = preg_replace('/\W+/', '-', $text);

$text = strtolower(trim($text, '-'));

}

else

{

$text = str_replace(" ", "-", $text);

$text = trim($text, '-');

}

return $text;

But I need to add to it , the pattern that will clean punctuation as the one you have..

I'll try to apply them to my class

as for .htaccess..

I added this

RewriteRule ^(.+/)$ $1.html [QSA]

and tried this also

RewriteRule ^$ index.html [QSA]

which I took out from your post .. they work ok but the issue is that the url is written in raw utf.. which is really ugly..

Thanks alot

David Replied at 1:48 PM on 2 Mar 2009

The squares you are talking about are probably due to either you not having set a locale, e.g. setlocale( LC_CTYPE, 'en_GB.utf8' );

Or it may be that the font you are using doesn't have those characters available. By raw UFT in the URL do you mean like: %7F or accented characters like ??

Hemant Replied at 10:12 AM on 10 May 2009

Hi,

I am trying to use this solution on WPMU 2.7.1 but my php files look different from what you have written here. Can you please let me know where do I make changes in my installation?

Thanks and Regards,

Hemant

David Replied at 10:34 AM on 10 May 2009

Could you provide more specific details as to what you are having trouble with. Try searching the files using a find facility in a text editor to search for the relevant parts.

Jolon Replied at 9:14 AM on 5 Jun 2009

Hoping someone can help. I run a blog and use Chinese as the main language. I've found that I can't create a "page" and use the name of the page (Chinese text) as the URL for the page. I have no problem with categories being in Chinese, but creating a "page" and using a Chinese URL doesn't work. I assume this is a problem that is mentioned above? I read through the fix but I am a little cautious as to doing it. Wondering if anyone can shed some light?

David Replied at 6:48 PM on 5 Jun 2009

You could try this. I think it does the same thing.

socialpreneur Replied at 5:32 PM on 3 Oct 2009

Could this hack be written into wp plugin? I don't mind fixing .htaccess, but changing these everytime on wp upgrades is kinda...hard work. I really wonder why wordpress teams always never cares about international users. I really hope this is added to core.

joe Replied at 11:22 AM on 23 Dec 2009

I know...the fact that wordpress team isn't aware of this is a real pain for international users. Now, let me try your code. Thanks so much because nobody seems to care about this accept semlabs!!

chris Replied at 8:34 AM on 5 Jun 2010

The big problem is how you know if the user is typing thai, chinese or japanese.. if you dont know the iconv will fail.

Post Comment

Thin comments left for links will be deleted.

Entry Info

Categories

Elsewhere

Top
Commentors