Unicode support in log messages and file names

Andrey grooz-work at gorodok.net
Sat Nov 11 16:24:06 CST 2006


> Do you really want to handle characters internally as 32-bit quantities?
> In practice, this will quadruple string overhead and break almost all of
> the ordinary string handling routines?
>
> I've been through this in several systems now. Using UTF-8 encoding
> internally is the right answer.

Python string handling routines work perfectly with unicode strings without 
noticeable performance overhead. And moreover, they usually DO NOT work for 
UTF-8 byte strings. For example, s[:3] or s.upper() won't work for UTF-8 
strings containing non-lating (multibyte) characters.

> Note also: it is insufficient to say "UNICODE UTF-8". You also need to
> specify the normalization.
>
> The normalization that is almost universally adopted for UNICODE is
> normalization C.

I am not sure normalization is nessessary for us if all we want is just to 
have non-latin log messages displayed correctly. :) Still we can use 
unicodedata.normalize() for that.


More information about the Mercurial-devel mailing list