Re: Consequences for use of hg for other applications than SCM was Re: German umlauts in file names

Marko Käning mk-lists at email.de
Sat Jun 21 06:24:16 CDT 2008


Hi Matt,

> One system (SVN) does name transcoding. It translates all filenames from
OK, that's what I thought myself in the first place.

> It also ignores issues like
> multiple encodings on a single filesystem, and the fact that for any
> pair of single byte character sets, there are characters that can't be
> transcoded.
Well, svn seems to cope quite well on that one.

> a) use only ASCII
> b) force everyone to use a specific single-byte character set for file names and contents
> c) use UTF-8 everywhere
Well, quite limiting for someone who doesn't care about tools and just about readable paths and file names in languages like German.

> The other system (Mercurial), stores precisely the bytes that the
> operating system reports and ignores encoding issues. This ensures that
> dumb tools are never confused, even if two users can't agree on
> encoding.
I got that. And I find it very valuable if I use a system like this for only SCM, which I do as well.

> In other words, if we're doing things sanely, transcoding isn't even an
> issue!
OK. I see.

> When we're not doing things sanely and mixing encodings, we have to
> choose the lesser of two evils. In your case, because you had UTF-8 on
> one end, things -mostly- worked out with SVN. 
I worked now for 4 years with svn and NEVER had a problem with all my German transcodings!
(Different versions of SVN and TortoiseSVN being in use throughout this time.)

> If you had a ü.h file, it
> would have likely confused your compiler though. You'd have a file named
> "\xfc" on disk but referred to as "\xc3\xbc" everywhere. And someone
> without ü in their charset (Russia, Japan, etc.) using your repo would
> have other issues.
I don't care about users with other charsets, since the repo is solely for a small workgroup using one and the same charset.

> With mercurial, ü.h on the UTF-8 machine would have become ÃŒ.h on a
> cp1252 machine. 
That's true.

> Ok, so which of these two cases should we prefer? My preference is to
> choose the strategy that's least likely to break tools, because tools
> are generally a lot stupider than people.
You are right. But in my case transcoding is exactly what I want, because I want to have readable path and file names for the users.
No tools are involved in processing of the files in work.

> Hopefully someone else can answer that.
;) Hope that my assumptions about the regkey were correct.

> > So, then the next question. How do I teach mercurial to use UTF-8 under Windows?
> First you have to set your system to operate in UTF-8 mode. I believe
> that's known as CP65001.
Thanks for the info! I'll give that a try. Wonder whether Windows can cope with that...

Sorry Matt if I seem to be a pain, but I do think that mercurial is a really cool tool!!!

So I am stuck here:

1) I would really like to use mercurial for my daily work, which is NOT only SCM.

2) I finally sort of accept that you have more arguments in favour of not going down that road of transcoding of path and file names.

3) I would be able to use all my umlauts on Windows, if I don't want to access the repo from another encoding (from my linux box).

4) I can forget about converting my old SVN repo into mercurial format.

5) I should evaluate svk.


My first impression concerning point 5 is that this alternative would require much more disk space than keeping a mercurial clone of the main repo. And I expect it to be even slower than SVN on its own. Well, we'll see.

OK, THANKS FOR YOUR PATIENCE IN DISCUSSING THIS ISSUE WITH ME IN DEPTH!

Regards,
Marko



More information about the Mercurial mailing list