Initial support of Unicode filenames

Martin Geisler mg at lazybytes.net
Thu Nov 3 16:54:07 CDT 2011


Victor Stinner <victor.stinner at haypocalc.com> writes:

> Le Jeudi 3 Novembre 2011 13:19:04 vous avez écrit :
>> Also, I'll see "Sweet crêpe recipe.txt" on my Latin-1 system.
>
> I'm not sure that you understood.

We understand.

> If your locale encoding is unable to encode all filenames: change your
> locale. If you cannot change the locale on your computer, use another
> computer.
>
>> That would again be a serious change compared to what we do today.
>
> Does the problem really exist? I'm not sure that people with ASCII
> locale encoding manipulate repositories with non-ASCII filenames.

I'm also not sure of that, but it's a fact that I can checkout a Russian
or Chinese project today and commit back to it. I wont be able to do
that if Mercurial aborts because of my puny Latin-1 locale.

> Why do you focus on the worst case, whereas Mercurial fails completly
> on the most common case? The common case is to have two encodings able
> to encode all characters that you are using, but using different
> bytes, and so you get mojibake.
>
> The mojibake is already is big problem, because if a file refers to
> "Sweet crêpe recipe.txt" file, it does also fail to find the file.

That depends on how the tool searches for the file: it just so happens
that make is also encoding agnostic and so it works together with the
encoding agnostic Mercurial.

> Mercurial does have a problem today, and I don't see how moving to
> Unicode would make the situation worse.
>
>> I would really like to see Mercurial do transcoding of filenames.
>
> What are you calling "transcoding"?

I call it transcoding if a filename moves from Latin-1 (my machine) to
UTF-8 (in Mercurial's manifest) to cp1251 (on some Windows machine).

> If the filename is stored as UTF-8, I consider that the filename type
> is Unicode. So you never *transcode* filenames. You *decode* filenames
> when you add a new file, you *encode* filenames when you do a
> checkout.
>
> Please see the Definitions chapiter of my Unicode book to avoid
> confusion:
> http://www.haypocalc.com/tmp/unicode-2011-07-20/html/definitions.html
>
>> I've deployed Mercurial at Swiss customers and they immediatedly ran
>> into problems with their unlauts.
>
> Latin1 is able to encode latin letters with umlauts.

Yes, I know... I'm Danish and live in Switzerland so I've seen my share
of non-ASCII filenames :-) Their problem was that they use boht Mac OS X
and Windows and so the UTF-8 filenames (decomposed, even!) from Mac OS X
doesn't look right when you see them in Windows.

Nothing new about this, it's a well-understood problem and I only
mentioned it to say that I agree with you that the current situation is
far from optimal.

-- 
Martin Geisler

Mercurial links: http://mercurial.ch/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20111103/8b5246be/attachment.pgp>


More information about the Mercurial-devel mailing list