Unicode support for non-unicode locales

Densetsu no Ero-sennin densetsu.no.ero.sennin at gmail.com
Tue Oct 9 12:45:08 CDT 2007


On 9 October 2007 (Tue), Matt Mackall wrote:
> utf-8$ touch <japan>
> utf-8$ tar --posix -c -f foo.tar <japan>
> utf-8$ zip foo.zip <japan>
>
> ascii$ tar --posix -x -v -f ../foo.tar
> \346\227\245\346\234\254\345\233\275
> ascii$ ls
> ?????????
> ascii$ rm *
> ascii$ unzip ../foo.zip
> Archive:  ../foo.zip
>  extracting: <garbage>
> ascii$ ls
> ?????????

Things are not that simple. Here's another example (I assume you can see 
Cyrillics).

$ echo $LANG
en_US.UTF-8
$  tar --version | head -n 1
tar (GNU tar) 1.18
$ touch проверка
$ tar --posix -c -f foo.tar проверка
$ tar -t -f foo.tar
проверка
$ rm проверка
$ LC_ALL=ru_RU.KOI8-R tar -t -f foo.tar | iconv -f KOI8-R
проверка
$ LC_ALL=ru_RU.KOI8-R tar -x -f foo.tar
$ ls # must output some garbage
п©я─п╬п╡п╣я─п╨п╟
$ ls | iconv -f KOI8-R
проверка

Actually, when unpacking an archive in POSIX.1-2001 format, tar produces 
correctly encoded filenames if locale encoding allows it. But if not, it 
encodes filenames in UTF-8, regardless of the locale, producing lots of 
garbage characters. Personally, I'd prefer it it to fail with lots of error 
messages instead of that.



More information about the Mercurial-devel mailing list