Consequences for use of hg for other applications than SCM was Re: German umlauts in file names

Hans Meine meine at informatik.uni-hamburg.de
Mon Jun 23 08:15:47 CDT 2008


Am Montag, 23. Juni 2008 14:38:15 schrieb Alexander Belchenko:
> Windows filesystem uses unicode for saving filenames.
> So if svn properly using unicode Win32 API there is absolutely no problems.

As Matt wrote, hg does *not* use the unicode API (which is also available in 
Python, see the link I posted above), but uses only 8-bit functions.  This 
way, unicode filenames cannot be preserved.  IMO this qualifies as a bug - 
OK, call it a documented, clean, but for certain users unexpected (and 
undesired) behavior which cannot be changed.

However, I think this should not be hard to fix for people like Marko.  Since 
backwards compatibility is definitely important, and people like Matt would 
probably be opposed to an unconditional switch to unicode (which would lead 
to problems for other people, as expressed in this thread), I think the 
easiest change would be to have a "unicode filename" switch that would treat 
all filenames in the repo as being UTF-8 encoded.  This way, the repo format 
stays the same (i.e. only 8-bit filenames are used), but whenever a filename 
is "applied to" or "fetched from" the local filesystem, it would need to be 
reencoded if sys.getfilesystemencoding() != "utf-8".

IMHO the case that the local filesystem encoding is incompatible with a 
filename from the repo could simply throw an error - that would only occur if 
someone explicitly told Mercurial to convert filenames when this is 
impossible.  (The user could then clone the repo without converting the 
filenames, assuming the other tools in question can deal with UTF-8 
filenames, or change the filesystem / update the OS in case they can't.)

-- 
Ciao, /  /
     /--/
    /  / ANS


More information about the Mercurial mailing list