Initial support of Unicode filenames

Victor Stinner victor.stinner at haypocalc.com
Thu Nov 3 10:13:23 CDT 2011


Le Jeudi 3 Novembre 2011 13:19:04 vous avez écrit :
> Also, I'll see "Sweet crêpe recipe.txt" on my Latin-1 system.

I'm not sure that you understood. If we store the filename as Unicode in 
Mercurial, a checkout will encode filenames to the locale encoding when 
creating files.

You have the Unicode string u"Sweet crêpe recipe.txt". If you locale encoding 
is latin1, Mercurial will create the file:

>>> u"Sweet crêpe recipe.txt".encode('latin1')
'Sweet cr\xeape recipe.txt'

If you locale encoding is UTF-8, if creates the file:

>>> u"Sweet crêpe recipe.txt".encode('UTF-8')
'Sweet cr\xc3\xaape recipe.txt'

If you list the directory content using the "ls" command: the locale is 
decoded from locale encoding and you will get back your ê (U+00EA). 

> > If this issue does really matter, we may add workarounds like encoding
> > the unencodable characters to something encoding. E.g. replace "ê"
> > (U+00EA) by "%EA" (3 characters encodable to ASCII), Mac OS X and
> > Gnome use this trick somewhere (I am not sure).
> 
> We'll need to recognize the file again for 'hg status' purposes. So it's
> probably no good to encode the "ê" by "%EA" unless we also start
> decoding all "%EA" into "ê" characters.

If we replace non-encodable ê character by %EA, we also have to replace %EA 
again with ê (U+00EA).

I don't really like the idea of using a custom "encoding" scheme (UTF-7, 
base64, punycode or anything else) because it just moves the problem to 
somewhere else. For example, if another file refers "Sweet crêpe recipe.txt" 
file, it will fail to find the file.

If your locale encoding is unable to encode all filenames: change your locale. 
If you cannot change the locale on your computer, use another computer.

> That would again be a serious change compared to what we do today.

Does the problem really exist? I'm not sure that people with ASCII locale 
encoding manipulate repositories with non-ASCII filenames.

Why do you focus on the worst case, whereas Mercurial fails completly on the 
most common case? The common case is to have two encodings able to encode all 
characters that you are using, but using different bytes, and so you get 
mojibake.

The mojibake is already is big problem, because if a file refers to "Sweet 
crêpe recipe.txt" file, it does also fail to find the file.

Mercurial does have a problem today, and I don't see how moving to Unicode 
would make the situation worse.

> I would really like to see Mercurial do transcoding of filenames.

What are you calling "transcoding"? If the filename is stored as UTF-8, I 
consider that the filename type is Unicode. So you never *transcode* filenames. 
You *decode* filenames when you add a new file, you *encode* filenames when you 
do a checkout.

Please see the Definitions chapiter of my Unicode book to avoid confusion:
http://www.haypocalc.com/tmp/unicode-2011-07-20/html/definitions.html

> I've deployed Mercurial at Swiss customers
> and they immediatedly ran into problems with their unlauts.

Latin1 is able to encode latin letters with umlauts.

Victor


More information about the Mercurial-devel mailing list