Unicode support for non-unicode locales

Matt Mackall mpm at selenic.com
Mon Oct 8 12:49:28 CDT 2007


On Mon, Oct 08, 2007 at 09:43:20PM +0600, Densetsu no Ero-sennin wrote:
> On 8 October 2007 (Mon), Matt Mackall wrote:
> > Does it make the corresponding changes to your project's Makefile,
> > etc., as well? What happens if someone does a checkout in an
> > ASCII/latin-1 locale?
> >
> > Filenames, just like their contents, are the users' data. Our mandate
> > is to preserve that data exactly.
> 
> You are perfectly right here. The question is, what exactly has to be 
> preserved. I mean, filenames are not just byte sequences, but rather 
> sequences of letters, digits and other symbols, which are encoded in 
> different byte sequences, depending on the locale. By not taking this into 
> account, Mercurial CORRUPTS the data and produces incorrect filenames in
> some cases.

Again, what happens if someone does a checkout in an ASCII/latin-1
locale? That's most of the computing world. The answer is: your
Russian characters are not just mangled, they're completely LOST. In
fact, you probably won't be able to check out your project at all
because filename "??????" will collide with filename "??????".

This fix might work fine for special cases like going from one Russian
or Japanese encoding to another, but in general, it makes a bad
problem worse. It's much better overall for data to be "corrupted" by
"passing it through untouched".

> As for Makefiles, it is make's fault, not Mercurial's. Make is simply not 
> designed to handle Unicode and therefore is subject to fail on non-ASCII 
> filenames.

The vast majority of toolchain programs that embed filenames in other
files will break. Make is simply the most obvious example. Similarly,
the vast majority of projects that people are managing in Mercurial
aren't prepared to do all their data filename handling in Unicode
either. Again, trying to be clever here takes a bad problem and makes
it worse. Breaking makefiles is a complete non-starter.

On the other hand, most of these programs work perfectly if you leave
the filenames alone - they're completely indifferent to what glyph a
particular bytecode represents. And of course, most of them work just
fine in UTF-8.

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial-devel mailing list