On how to sort an array of UTF-8 strings

September 24, 2008 at 12:16 8 comments

This article is based on a question asked by me on stackoverflow.com and illustrates the way I solved the question myself and discovered a PHP bug on Windows.

Sorting an array of strings in PHP seems to be a no-brainer at all. There are a lot of sort functions with sort() being the most common one. The problem arises when the strings used in the array are multi-byte encoded, for example UTF-8 encoded. Because PHP comparison functions cannot operate on those strings (they do a byte-per-byte comparison) sorting does not work as expected either. Furthermore language specific sorting properties are not taken into consideration when sorting with sort() and the default parameters. In Swedish for example an Ä is sorted at the end of the alphabet while in German Ä normally is equivalent to A (when using the DIN 5007 sorting method).

Fortunately PHP provides a function which copes with this problem: strcoll(). The function can be used for array sorting by just specifying the function name as the second parameter to usort(). The sort() function also has a flag (SORT_LOCALE_STRING) which actually seems to do the same as usort() together with a strcoll() callback.

To summarize we can say, that sorting an array of UTF-8 strings in a language aware manner is more or less simply a question of setting the correct locale. Let’s look at the following example using German as the reference language and saved with a UTF-8 encoding.

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'de_DE.utf8');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

This will result in an array of Apfel, Äpfel, Österreich, Ostfriesland, Übergabe, Unterführung (obviously we’re using DIN 5007 sorting here).

As sorting now is locale-dependent we have to respect the PHP environment, which means what machine are we running our script on – Windows or *nix?

First of all, if we have a *nix machine, the used locale must be installed on the system. You can get a list of installed locales by issuing the command locale -a on the command line. Be sure to use the correct encoding with the desired locale – the encoding must match the string encoding.

Things get more complicated on Windows machines as locales are named differently. The default naming scheme is Country_Language.Encoding. Information on locales on Windows can be found on MSDN: Language and Country/Region Strings, Language Strings, Country/Region Strings and Code Pages. Furthermore encodings are not specified like on *nix machines but rather by using code pages. As we’re using UTF-8 in our example we have to use the UTF-8 Windows code page, which is 65001. Putting all these things together we get to a locale of German_Germany.65001 for our example. For the sake of completeness the normal code page for Western Europe would be 1252.

This leads us to the following code snippet (UTF-8 encoded strings):

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

What the heck???? Übergabe, Apfel, Ostfriesland, Unterführung, Äpfel, Österreich?? That obviously doesn’t work… What’s the problem? Let’s try to use non UTF-8 strings (don’t forget to recode the file to ANSI, Windows-1252 or ISO-8859-1):

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.1252');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

Now we get Apfel, Äpfel, Österreich, Ostfriesland, Übergabe, Unterführung. OK, non-UTF-8 is working correctly. Let’s dig in deeper. What does strcoll() do with my array? Let’s trace what’s going on (thanks to Huppie for the idea of tracing what strcoll() is doing):

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);

The output is:

Äpfel Ostfriesland 2147483647
Äpfel Übergabe 2147483647
Äpfel Unterführung 2147483647
Äpfel Apfel 2147483647
Österreich Äpfel 2147483647
Ostfriesland Apfel 2147483647
Ostfriesland Übergabe 2147483647
Unterführung Ostfriesland 2147483647
Apfel Übergabe 2147483647

As you can see strcol() returns 2147483647 on every comparison operation. This is reproducible and emerges only on Windows machines (by the way the PHP version does not seem to matter as I tried the snippet on PHP 5.2.4, 5.2.5 an 5.2.6). Actually this is what I’d classify as a bug. Therefore I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows

Summary: Currently it is not possible to sort UTF-8 strings on a WIndows machine simply using PHP-provided functions. A possible solution would be to recode the strings to Windows-1252 or ISO-8859-1 encoding (using mb_convert_encoding() or iconv()) and do a sort on the recoded array (provided by ΤΖΩΤΖΙΟΥ on stackoverflow.com).

Entry filed under: PHP. Tags: , , .

Installed phpUnderControl on our development server Extended Zend_Ldap is in Standard Incubator

8 Comments Add your own

  • 1. Update on strcoll() UTF-8 issue « <?php  |  December 8, 2008 at 11:45

    […] charset, e.g. German_Germany.65001, and therfore you cannot use strcoll() or similar functions for locale-aware string operations with these charsets. It simply is not possible due to a Windows CRT […]

    Reply
  • 2. Christian  |  April 5, 2009 at 10:44

    Thanks! I was just wondering why sorting didn’t work in Windows, but now I know!

    Reply
  • 3. Stefan Gehrig  |  April 5, 2009 at 13:40

    @Christian:

    Please make sure you also read https://sgehrig.wordpress.com/2008/12/08/update-on-strcoll-utf-8-issue/ which clarifies this issue as a non-PHP-related issue but rather a Windows problem.

    Reply
  • 4. Alyssa Calderon  |  May 27, 2010 at 18:52

    Really great read! Honestly!

    Reply
  • 5. Bettye Dobbs  |  May 30, 2010 at 23:35

    If only more than 13 people could hear this!

    Reply
  • […] On how to sort an array of UTF-8 strings […]

    Reply
  • 7. Eugenio  |  September 6, 2012 at 18:10

    Thanks a lot for the detailed and great explanation.
    I guess the problem is still there isn’t it? They said to use mbstring instead but…is there a mbstring version of sort or strcoll? I don’t think so…

    Reply
  • 8. Konstantins  |  January 13, 2013 at 12:09

    thanks a lot this was helpful

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Twitter

del.icio.us

Certification


%d bloggers like this: