Consequences for use of hg for other applications than SCM was Re: German umlauts in file names

Matt Mackall mpm at selenic.com
Fri Jun 20 12:43:23 CDT 2008


On Fri, 2008-06-20 at 17:43 +0200, Marko Kaening wrote:
> Hi (Matt),
> 
> In case of umlaut-containing file names Mercurial or TortoiseHg does NOT 
> set a file name like SVN or TortoiseSVN would do, in case the file 
> originates from systems using different charsets. The file name is not 
> adapted to the current charset, how SVN would do it.
> 
> Having in mind what Matt wrote earlier in this thread it looks as if this
> behaviour is acutally wanted behaviour and not a bug:
>                       ==============================
> <cite Matt>
> >
> > Mercurial by design does absolutely no encoding on filenames, as
> > filenames very often have to byte-for-byte agree with their
> > representation in other files such as makefiles, etc.
> >
> <cite/>
> 
> BUT, I believe that it is not what the user really wants in some cases.

Users want a lot of things they don't fully understand the implications
of.

> echo "umlauts added in utf-8 on linux box: öäü" > file-öäü.txt

This is a perfect example of one of the pitfalls of encoding. There are
no umlauts in the above in any standard encoding:

00001460: 2275 6d6c 6175 7473 2061 6464 6564 2069  "umlauts added i
00001470: 6e20 7574 662d 3820 6f6e 206c 696e 7578  n utf-8 on linux
00001480: 2062 6f78 3a20 c383 c2b6 c383 c2a4 c383   box: ..........
00001490: c2bc 2220 3e20 6669 6c65 2dc3 83c2 b6c3  .." > file-.....
000014a0: 83c2 a4c3 83c2 bc2e 7478 740a 0a73 766e  ........txt..svn

Because your editor and your mail client tried to be smart about
encoding and/or were misconfigured, the original -bytes- of your message
are now lost. 	

> As you can see, Mercurial or TortoiseHg does NOT set file name like 
> TortoiseSVN would do. The file name is not adapted to the current charset, 
> how SVN would do it.
> 
> That's what I mean here. I THINK THAT'S INCONSISTENT, BECAUSE NOT 
> PORTABLE.
> 
> Up to now I haven't figured out, which parameter for --encoding or 
> HGENCODING I should use to make it work. My console seems to be set to 
> cp850, the system might be set to cp1252, if I understood right that the 
> following regkey is the one to believe:

Neither of those will have any effect: Mercurial does not encode
filenames. What comes out is the same as what goes in.

You either need to set your Windows machine to use UTF-8 or set your
Linux machine to use something roughly cp850-compatible like Latin1.

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial mailing list