Consequences for use of hg for other applications than SCM was Re: German umlauts in file names

Marko Kaening mk362 at mch.osram.de
Fri Jun 20 10:43:28 CDT 2008


Hi (Matt),

In case of umlaut-containing file names Mercurial or TortoiseHg does NOT 
set a file name like SVN or TortoiseSVN would do, in case the file 
originates from systems using different charsets. The file name is not 
adapted to the current charset, how SVN would do it.

Having in mind what Matt wrote earlier in this thread it looks as if this
behaviour is acutally wanted behaviour and not a bug:
                      ==============================
<cite Matt>
>
> Mercurial by design does absolutely no encoding on filenames, as
> filenames very often have to byte-for-byte agree with their
> representation in other files such as makefiles, etc.
>
<cite/>

BUT, I believe that it is not what the user really wants in some cases. I 
understand the byte-for-byte argument for makefiles in a way, but from a 
directory-structure point of view (probably a svn p.o.v.) I think that 
readability is what one needs for certain applications, which need not 
necessarily to be source code management, but e.g. versioned media storage 
in a distributed manner.

I think that svn's behaviour is more consistent here.

If there is no way such a feature could be implemented into hg (perhaps as 
an option), I guess it is not usable for anything else than SCM, which 
would be really sad, since I began to like the concept, repo size and 
speed of it.

So there is probably only svk left as an alternative...

Regards,
Marko


------------------------------------------------------------------------------


It follows a description how I

1) set up an SVN repo with umlauts in file content and file name - OK
2) do a checkout directly on the server                          - OK
3) do a checkout to WXP machine using TSVN                       - OK
4) do a convert to mercurial format                              - OK
5) do a clone on the server                                      - OK
6) do a clone on the WXP machine using THg                       - NOT OK 


------------------------------------------------------------------------------
svn-fill-from-linux.bash (creates repo, inserts file)
------------------------------------------------------------------------------

#!/bin/bash

pushd .

svnadmin create /home/kaening/svntest

cd /home/kaening/temp

svn co file:///home/kaening/svntest wc

cd wc

echo "umlauts added in utf-8 on linux box: öäü" > file-öäü.txt

svn stat -v

svn add file-*

svn commit -m "commit file with umlauts 'öäü' in name and content"

svn stat -v

svn log file-*

hexdump -C file-*

popd

------------------------------------------------------------------------------

kaening at flmpc21:~> bash temp/svn-fill-from-linux.bash
~ ~
Ausgecheckt, Revision 0.
?                                       file-öäü.txt
                0        0  ?           .
A         file-öäü.txt
Hinzufügen     file-öäü.txt
Ãœbertrage Daten .
Revision 1 übertragen.
                0        0  ?           .
                1        1 kaening      file-öäü.txt
------------------------------------------------------------------------
r1 | kaening | 2008-06-20 16:13:21 +0200 (Fr, 20 Jun 2008) | 1 line

commit file with umlauts 'öäü' in name and content
------------------------------------------------------------------------

00000000  75 6d 6c 61 75 74 73 20  61 64 64 65 64 20 69 6e  |umlauts added in|
00000010  20 75 74 66 2d 38 20 6f  6e 20 6c 69 6e 75 78 20  | utf-8 on linux |
00000020  62 6f 78 3a 20 c3 b6 c3  a4 c3 bc 0a              |box: .......|
~
kaening at flmpc21:~>
------------------------------------------------------------------------------

So, that means: on the server everything looks fine, as expected.

------------------------------------------------------------------------------

If one does now checkout a working copy of this repository via 
TortoiseSVN on WXP one gets exactly the same file content from above. 
Surprisingly NOTEPAD.EXE seems to be able to understand that the file is 
utf-8 encoded and displays the umlauts correctly.

TortoiseSVN also displays the commit message (containg umlauts as well) 
correctly.

The file name contains correctly the umlauts.
=============================================

So, also the windows SVN client has no proble!!!



------------------------------------------------------------------------------




4) do a convert to mercurial format                              - OK
==============================================================================


Now we'll import the repo into mercurial on my linux server:

------------------------------------------------------------------------------
hg-convert-svn.bash
------------------------------------------------------------------------------

#!/bin/bash

pushd .

cd /home/kaening/temp

mkdir hg

hg convert file:///home/kaening/temp/svntest hg

hg clone hg hg-wc

ls hg-wc

hexdump -C hg-wc/file-*

popd
------------------------------------------------------------------------------
kaening at flmpc21:~> bash hg-convert-svn.bash
~/temp ~/temp
initializing destination hg repository
scanning source...
sorting...
converting...
0 commit file with umlauts 'öäü' in name and content
updating working directory
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
file-öäü.txt
00000000  75 6d 6c 61 75 74 73 20  61 64 64 65 64 20 69 6e  |umlauts added 
in|
00000010  20 75 74 66 2d 38 20 6f  6e 20 6c 69 6e 75 78 20  | utf-8 on 
linux |
00000020  62 6f 78 3a 20 c3 b6 c3  a4 c3 bc 0a              |box: .......|
0000002c
~/temp
------------------------------------------------------------------------------

I.e. everything is still fine!!! File name correct, content unchanged.
================================

On the server!!!

------------------------------------------------------------------------------



F I N A L L Y :



6) do a clone on the WXP machine using THg                       - NOT OK
==============================================================================

Now we try to check out that repo with TortoiseHg on my WXP machine.

------------------------------------------------------------------------------
running ""C:\Programme\TortoiseHg\TortoisePlink.exe" -ssh -2 
kaening at flmpc35 "hg -R temp/hg serve --stdio""
requesting all changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
updating working directory
resolving manifests
getting file-öäü.txt
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
[command completed successfully]
------------------------------------------------------------------------------

Ha, even the file name gets displayed correctly in the log output.


Now let's have a look at the directory content:

------------------------------------------------------------------------------
C:\Dokumente und Einstellungen\M.Kaening\Eigene Dateien\hg-wc>dir
 Volume in Laufwerk C: hat keine Bezeichnung.
 Volumeseriennummer: 3855-15E8

 Verzeichnis von C:\Dokumente und Einstellungen\M.Kaening\Eigene Dateien\hg-wc

20.06.2008  16:56    <DIR>          .
20.06.2008  16:56    <DIR>          ..
20.06.2008  16:56    <DIR>          .hg
20.06.2008  16:55                44 file-öäü.txt
               1 Datei(en)             44 Bytes
               3 Verzeichnis(se), 38.812.647.424 Bytes frei

C:\Dokumente und Einstellungen\M.Kaening\Eigene Dateien\hg-wc>
------------------------------------------------------------------------------


As you can see, Mercurial or TortoiseHg does NOT set file name like 
TortoiseSVN would do. The file name is not adapted to the current charset, 
how SVN would do it.

That's what I mean here. I THINK THAT'S INCONSISTENT, BECAUSE NOT 
PORTABLE.

Up to now I haven't figured out, which parameter for --encoding or 
HGENCODING I should use to make it work. My console seems to be set to 
cp850, the system might be set to cp1252, if I understood right that the 
following regkey is the one to believe:

------------------------------------------------------------------------------
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP
------------------------------------------------------------------------------

The content of the file is correct. No change, it stays UTF-8.


More information about the Mercurial mailing list