Unicode support for non-unicode locales

Matt Mackall mpm at selenic.com
Tue Oct 9 11:48:21 CDT 2007


On Tue, Oct 09, 2007 at 12:07:50PM +0600, Densetsu no Ero-sennin wrote:
> Moreover, most modern distributions offer UTF-8 by default. And most modern 
> file archivers, including GNU tar in POSIX mode, whose duty is to preserve 
> user's data exactly, are creating files in local encoding when unpacking 
> archives.

Oh really?

utf-8$ touch <japan>
utf-8$ tar --posix -c -f foo.tar <japan>
utf-8$ zip foo.zip <japan>

ascii$ tar --posix -x -v -f ../foo.tar 
\346\227\245\346\234\254\345\233\275
ascii$ ls
?????????
ascii$ rm *
ascii$ unzip ../foo.zip
Archive:  ../foo.zip
 extracting: <garbage>
ascii$ ls
?????????

And frankly, I think this is the only sensible thing to do. Because if I do:

utf-8:
$ hg init
$ touch <japanese> <korean> <russian> <french> english
$ echo "cat <japanese> <korean> <russian> <french> english | md5sum" > check
$ chmod +x check
$ hg ci -Am "test"

ASCII, latin-1, koi8, or basically any other encoding:
$ hg pull -u
$ ./check
d41d8cd98f00b204e9800998ecf8427e -

..it works.

If we start trying to transcode filenames, we will have to transcode
file contents as well, and that problem is insoluble.

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial-devel mailing list