Unicode support in log messages and file names

Sat Nov 11 14:58:02 CST 2006

On Sun, 2006-11-12 at 02:46 +0600, Andrey wrote:
> > I've tried exactly this one year ago when Mercurial was much smaller
> > and after talking to other people we (including Matt) decided that
> > the desired way is to immediately convert from local encoding to
> > UTF-8, like Vicent Seguí Pascual originally proposed.
> >
> > Unfortunately I wanted to do it exactly like you by that time, the
> > result is that we have no unified log encoding yet.
> >
> > You can see his patches in the list archive from July 2005.
> >
> > Thomas
> 
> Well, I still believe that using unicode strings internally is the right way. 
> In fact Python-3000 is going to use them by default instead of bytestrings.

Do you really want to handle characters internally as 32-bit quantities?
In practice, this will quadruple string overhead and break almost all of
the ordinary string handling routines?

I've been through this in several systems now. Using UTF-8 encoding
internally is the right answer.

Note also: it is insufficient to say "UNICODE UTF-8". You also need to
specify the normalization.

The normalization that is almost universally adopted for UNICODE is
normalization C.

shap