Re: Consequences for use of hg for other applications than SCM was Re: German umlauts in file names

Sat Jun 21 06:24:16 CDT 2008

Hi Matt,

> One system (SVN) does name transcoding. It translates all filenames from
OK, that's what I thought myself in the first place.

> It also ignores issues like
> multiple encodings on a single filesystem, and the fact that for any
> pair of single byte character sets, there are characters that can't be
> transcoded.
Well, svn seems to cope quite well on that one.

> a) use only ASCII
> b) force everyone to use a specific single-byte character set for file names and contents
> c) use UTF-8 everywhere
Well, quite limiting for someone who doesn't care about tools and just about readable paths and file names in languages like German.

> The other system (Mercurial), stores precisely the bytes that the
> operating system reports and ignores encoding issues. This ensures that
> dumb tools are never confused, even if two users can't agree on
> encoding.
I got that. And I find it very valuable if I use a system like this for only SCM, which I do as well.

> In other words, if we're doing things sanely, transcoding isn't even an
> issue!
OK. I see.

> When we're not doing things sanely and mixing encodings, we have to
> choose the lesser of two evils. In your case, because you had UTF-8 on
> one end, things -mostly- worked out with SVN. 
I worked now for 4 years with svn and NEVER had a problem with all my German transcodings!
(Different versions of SVN and TortoiseSVN being in use throughout this time.)

> If you had a ü.h file, it
> would have likely confused your compiler though. You'd have a file named
> "\xfc" on disk but referred to as "\xc3\xbc" everywhere. And someone
> without ü in their charset (Russia, Japan, etc.) using your repo would
> have other issues.
I don't care about users with other charsets, since the repo is solely for a small workgroup using one and the same charset.

> With mercurial, ü.h on the UTF-8 machine would have become ÃŒ.h on a
> cp1252 machine. 
That's true.

> Ok, so which of these two cases should we prefer? My preference is to
> choose the strategy that's least likely to break tools, because tools
> are generally a lot stupider than people.
You are right. But in my case transcoding is exactly what I want, because I want to have readable path and file names for the users.
No tools are involved in processing of the files in work.

> Hopefully someone else can answer that.
;) Hope that my assumptions about the regkey were correct.

> > So, then the next question. How do I teach mercurial to use UTF-8 under Windows?
> First you have to set your system to operate in UTF-8 mode. I believe
> that's known as CP65001.
Thanks for the info! I'll give that a try. Wonder whether Windows can cope with that...

Sorry Matt if I seem to be a pain, but I do think that mercurial is a really cool tool!!!

So I am stuck here:

1) I would really like to use mercurial for my daily work, which is NOT only SCM.

2) I finally sort of accept that you have more arguments in favour of not going down that road of transcoding of path and file names.

3) I would be able to use all my umlauts on Windows, if I don't want to access the repo from another encoding (from my linux box).

4) I can forget about converting my old SVN repo into mercurial format.

5) I should evaluate svk.

My first impression concerning point 5 is that this alternative would require much more disk space than keeping a mercurial clone of the main repo. And I expect it to be even slower than SVN on its own. Well, we'll see.

OK, THANKS FOR YOUR PATIENCE IN DISCUSSING THIS ISSUE WITH ME IN DEPTH!

Regards,
Marko