Posts tagged ‘utf8’

Update on strcoll() UTF-8 issue

I just stumpled upon a comment in the Zend Framework Issue Tracker about an UTF-8 issue with PHP on Windows (the issue was about some problem within Zend_Ldap)  which pointed to a MSDN page about setlocale and _wsetlocale. It’s clearly stated there that the CRT function setlocale() does not work with multi-byte charsets on Windows:

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL. The set of language and country/region codes supported by setlocale is listed in Language and Country/Region Strings.

That means that setlocale() does not work on Windows when given a locale with an UTF-8 charset, e.g. German_Germany.65001, and therfore you cannot use strcoll() or similar functions for locale-aware string operations with these charsets. It simply is not possible due to a Windows CRT limitation.

Advertisements

December 8, 2008 at 11:45 1 comment

On how to sort an array of UTF-8 strings

This article is based on a question asked by me on stackoverflow.com and illustrates the way I solved the question myself and discovered a PHP bug on Windows.

Sorting an array of strings in PHP seems to be a no-brainer at all. There are a lot of sort functions with sort() being the most common one. The problem arises when the strings used in the array are multi-byte encoded, for example UTF-8 encoded. Because PHP comparison functions cannot operate on those strings (they do a byte-per-byte comparison) sorting does not work as expected either. Furthermore language specific sorting properties are not taken into consideration when sorting with sort() and the default parameters. In Swedish for example an Ä is sorted at the end of the alphabet while in German Ä normally is equivalent to A (when using the DIN 5007 sorting method).

Fortunately PHP provides a function which copes with this problem: strcoll(). The function can be used for array sorting by just specifying the function name as the second parameter to usort(). The sort() function also has a flag (SORT_LOCALE_STRING) which actually seems to do the same as usort() together with a strcoll() callback.

To summarize we can say, that sorting an array of UTF-8 strings in a language aware manner is more or less simply a question of setting the correct locale. Let’s look at the following example using German as the reference language and saved with a UTF-8 encoding.

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'de_DE.utf8');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

This will result in an array of Apfel, Äpfel, Österreich, Ostfriesland, Übergabe, Unterführung (obviously we’re using DIN 5007 sorting here).

As sorting now is locale-dependent we have to respect the PHP environment, which means what machine are we running our script on – Windows or *nix?

First of all, if we have a *nix machine, the used locale must be installed on the system. You can get a list of installed locales by issuing the command locale -a on the command line. Be sure to use the correct encoding with the desired locale – the encoding must match the string encoding.

Things get more complicated on Windows machines as locales are named differently. The default naming scheme is Country_Language.Encoding. Information on locales on Windows can be found on MSDN: Language and Country/Region Strings, Language Strings, Country/Region Strings and Code Pages. Furthermore encodings are not specified like on *nix machines but rather by using code pages. As we’re using UTF-8 in our example we have to use the UTF-8 Windows code page, which is 65001. Putting all these things together we get to a locale of German_Germany.65001 for our example. For the sake of completeness the normal code page for Western Europe would be 1252.

This leads us to the following code snippet (UTF-8 encoded strings):

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

What the heck???? Übergabe, Apfel, Ostfriesland, Unterführung, Äpfel, Österreich?? That obviously doesn’t work… What’s the problem? Let’s try to use non UTF-8 strings (don’t forget to recode the file to ANSI, Windows-1252 or ISO-8859-1):

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.1252');
usort($array, 'strcoll'); // or equivalent sort($array, SORT_LOCALE_STRING);
setlocale(LC_COLLATE, $oldLocale);

Now we get Apfel, Äpfel, Österreich, Ostfriesland, Übergabe, Unterführung. OK, non-UTF-8 is working correctly. Let’s dig in deeper. What does strcoll() do with my array? Let’s trace what’s going on (thanks to Huppie for the idea of tracing what strcoll() is doing):

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$array=array('Übergabe', 'Ostfriesland', 'Äpfel', 'Unterführung', 'Apfel', 'Österreich');
$oldLocale=setlocale(LC_COLLATE, "0");
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);

The output is:

Äpfel Ostfriesland 2147483647
Äpfel Übergabe 2147483647
Äpfel Unterführung 2147483647
Äpfel Apfel 2147483647
Österreich Äpfel 2147483647
Ostfriesland Apfel 2147483647
Ostfriesland Übergabe 2147483647
Unterführung Ostfriesland 2147483647
Apfel Übergabe 2147483647

As you can see strcol() returns 2147483647 on every comparison operation. This is reproducible and emerges only on Windows machines (by the way the PHP version does not seem to matter as I tried the snippet on PHP 5.2.4, 5.2.5 an 5.2.6). Actually this is what I’d classify as a bug. Therefore I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows

Summary: Currently it is not possible to sort UTF-8 strings on a WIndows machine simply using PHP-provided functions. A possible solution would be to recode the strings to Windows-1252 or ISO-8859-1 encoding (using mb_convert_encoding() or iconv()) and do a sort on the recoded array (provided by ΤΖΩΤΖΙΟΥ on stackoverflow.com).

September 24, 2008 at 12:16 8 comments


Twitter

del.icio.us

Certification