Posts tagged ‘array’

On how to sort an array of UTF-8 strings

This article is based on a question asked by me on stackoverflow.com and illustrates the way I solved the question myself and discovered a PHP bug on Windows.

Sorting an array of strings in PHP seems to be a no-brainer at all. There are a lot of sort functions with sort() being the most common one. The problem arises when the strings used in the array are multi-byte encoded, for example UTF-8 encoded. Because PHP comparison functions cannot operate on those strings (they do a byte-per-byte comparison) sorting does not work as expected either. Furthermore language specific sorting properties are not taken into consideration when sorting with sort() and the default parameters. In Swedish for example an Ä is sorted at the end of the alphabet while in German Ä normally is equivalent to A (when using the DIN 5007 sorting method).

Fortunately PHP provides a function which copes with this problem: strcoll(). The function can be used for array sorting by just specifying the function name as the second parameter to usort(). The sort() function also has a flag (SORT_LOCALE_STRING) which actually seems to do the same as usort() together with a strcoll() callback.

To summarize we can say, that sorting an array of UTF-8 strings in a language aware manner is more or less simply a question of setting the correct locale. Let’s look at the following example using German as the reference language and saved with a UTF-8 encoding.

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'de_DE.utf8');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

This will result in an array of Apfel, Äpfel, Österreich, Ostfriesland, Übergabe, Unterführung (obviously we’re using DIN 5007 sorting here).

As sorting now is locale-dependent we have to respect the PHP environment, which means what machine are we running our script on – Windows or *nix?

First of all, if we have a *nix machine, the used locale must be installed on the system. You can get a list of installed locales by issuing the command locale -a on the command line. Be sure to use the correct encoding with the desired locale – the encoding must match the string encoding.

Things get more complicated on Windows machines as locales are named differently. The default naming scheme is Country_Language.Encoding. Information on locales on Windows can be found on MSDN: Language and Country/Region Strings, Language Strings, Country/Region Strings and Code Pages. Furthermore encodings are not specified like on *nix machines but rather by using code pages. As we’re using UTF-8 in our example we have to use the UTF-8 Windows code page, which is 65001. Putting all these things together we get to a locale of German_Germany.65001 for our example. For the sake of completeness the normal code page for Western Europe would be 1252.

This leads us to the following code snippet (UTF-8 encoded strings):

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

What the heck???? Übergabe, Apfel, Ostfriesland, Unterführung, Äpfel, Österreich?? That obviously doesn’t work… What’s the problem? Let’s try to use non UTF-8 strings (don’t forget to recode the file to ANSI, Windows-1252 or ISO-8859-1):

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.1252');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

Now we get Apfel, Äpfel, Österreich, Ostfriesland, Übergabe, Unterführung. OK, non-UTF-8 is working correctly. Let’s dig in deeper. What does strcoll() do with my array? Let’s trace what’s going on (thanks to Huppie for the idea of tracing what strcoll() is doing):

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);

The output is:

Äpfel Ostfriesland 2147483647
Äpfel Übergabe 2147483647
Äpfel Unterführung 2147483647
Äpfel Apfel 2147483647
Österreich Äpfel 2147483647
Ostfriesland Apfel 2147483647
Ostfriesland Übergabe 2147483647
Unterführung Ostfriesland 2147483647
Apfel Übergabe 2147483647

As you can see strcol() returns 2147483647 on every comparison operation. This is reproducible and emerges only on Windows machines (by the way the PHP version does not seem to matter as I tried the snippet on PHP 5.2.4, 5.2.5 an 5.2.6). Actually this is what I’d classify as a bug. Therefore I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows

Summary: Currently it is not possible to sort UTF-8 strings on a WIndows machine simply using PHP-provided functions. A possible solution would be to recode the strings to Windows-1252 or ISO-8859-1 encoding (using mb_convert_encoding() or iconv()) and do a sort on the recoded array (provided by ΤΖΩΤΖΙΟΥ on stackoverflow.com).

September 24, 2008 at 12:16 8 comments


Twitter

del.icio.us

Certification