Consequences for use of hg for other applications than SCM was Re: German umlauts in file names
Marko Kaening
mk362 at mch.osram.de
Fri Jun 20 10:43:28 CDT 2008
Hi (Matt),
In case of umlaut-containing file names Mercurial or TortoiseHg does NOT
set a file name like SVN or TortoiseSVN would do, in case the file
originates from systems using different charsets. The file name is not
adapted to the current charset, how SVN would do it.
Having in mind what Matt wrote earlier in this thread it looks as if this
behaviour is acutally wanted behaviour and not a bug:
==============================
<cite Matt>
>
> Mercurial by design does absolutely no encoding on filenames, as
> filenames very often have to byte-for-byte agree with their
> representation in other files such as makefiles, etc.
>
<cite/>
BUT, I believe that it is not what the user really wants in some cases. I
understand the byte-for-byte argument for makefiles in a way, but from a
directory-structure point of view (probably a svn p.o.v.) I think that
readability is what one needs for certain applications, which need not
necessarily to be source code management, but e.g. versioned media storage
in a distributed manner.
I think that svn's behaviour is more consistent here.
If there is no way such a feature could be implemented into hg (perhaps as
an option), I guess it is not usable for anything else than SCM, which
would be really sad, since I began to like the concept, repo size and
speed of it.
So there is probably only svk left as an alternative...
Regards,
Marko
------------------------------------------------------------------------------
It follows a description how I
1) set up an SVN repo with umlauts in file content and file name - OK
2) do a checkout directly on the server - OK
3) do a checkout to WXP machine using TSVN - OK
4) do a convert to mercurial format - OK
5) do a clone on the server - OK
6) do a clone on the WXP machine using THg - NOT OK
------------------------------------------------------------------------------
svn-fill-from-linux.bash (creates repo, inserts file)
------------------------------------------------------------------------------
#!/bin/bash
pushd .
svnadmin create /home/kaening/svntest
cd /home/kaening/temp
svn co file:///home/kaening/svntest wc
cd wc
echo "umlauts added in utf-8 on linux box: öäü" > file-öäü.txt
svn stat -v
svn add file-*
svn commit -m "commit file with umlauts 'öäü' in name and content"
svn stat -v
svn log file-*
hexdump -C file-*
popd
------------------------------------------------------------------------------
kaening at flmpc21:~> bash temp/svn-fill-from-linux.bash
~ ~
Ausgecheckt, Revision 0.
? file-öäü.txt
0 0 ? .
A file-öäü.txt
Hinzufügen file-öäü.txt
Ãbertrage Daten .
Revision 1 übertragen.
0 0 ? .
1 1 kaening file-öäü.txt
------------------------------------------------------------------------
r1 | kaening | 2008-06-20 16:13:21 +0200 (Fr, 20 Jun 2008) | 1 line
commit file with umlauts 'öäü' in name and content
------------------------------------------------------------------------
00000000 75 6d 6c 61 75 74 73 20 61 64 64 65 64 20 69 6e |umlauts added in|
00000010 20 75 74 66 2d 38 20 6f 6e 20 6c 69 6e 75 78 20 | utf-8 on linux |
00000020 62 6f 78 3a 20 c3 b6 c3 a4 c3 bc 0a |box: .......|
~
kaening at flmpc21:~>
------------------------------------------------------------------------------
So, that means: on the server everything looks fine, as expected.
------------------------------------------------------------------------------
If one does now checkout a working copy of this repository via
TortoiseSVN on WXP one gets exactly the same file content from above.
Surprisingly NOTEPAD.EXE seems to be able to understand that the file is
utf-8 encoded and displays the umlauts correctly.
TortoiseSVN also displays the commit message (containg umlauts as well)
correctly.
The file name contains correctly the umlauts.
=============================================
So, also the windows SVN client has no proble!!!
------------------------------------------------------------------------------
4) do a convert to mercurial format - OK
==============================================================================
Now we'll import the repo into mercurial on my linux server:
------------------------------------------------------------------------------
hg-convert-svn.bash
------------------------------------------------------------------------------
#!/bin/bash
pushd .
cd /home/kaening/temp
mkdir hg
hg convert file:///home/kaening/temp/svntest hg
hg clone hg hg-wc
ls hg-wc
hexdump -C hg-wc/file-*
popd
------------------------------------------------------------------------------
kaening at flmpc21:~> bash hg-convert-svn.bash
~/temp ~/temp
initializing destination hg repository
scanning source...
sorting...
converting...
0 commit file with umlauts 'öäü' in name and content
updating working directory
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
file-öäü.txt
00000000 75 6d 6c 61 75 74 73 20 61 64 64 65 64 20 69 6e |umlauts added
in|
00000010 20 75 74 66 2d 38 20 6f 6e 20 6c 69 6e 75 78 20 | utf-8 on
linux |
00000020 62 6f 78 3a 20 c3 b6 c3 a4 c3 bc 0a |box: .......|
0000002c
~/temp
------------------------------------------------------------------------------
I.e. everything is still fine!!! File name correct, content unchanged.
================================
On the server!!!
------------------------------------------------------------------------------
F I N A L L Y :
6) do a clone on the WXP machine using THg - NOT OK
==============================================================================
Now we try to check out that repo with TortoiseHg on my WXP machine.
------------------------------------------------------------------------------
running ""C:\Programme\TortoiseHg\TortoisePlink.exe" -ssh -2
kaening at flmpc35 "hg -R temp/hg serve --stdio""
requesting all changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
updating working directory
resolving manifests
getting file-öäü.txt
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
[command completed successfully]
------------------------------------------------------------------------------
Ha, even the file name gets displayed correctly in the log output.
Now let's have a look at the directory content:
------------------------------------------------------------------------------
C:\Dokumente und Einstellungen\M.Kaening\Eigene Dateien\hg-wc>dir
Volume in Laufwerk C: hat keine Bezeichnung.
Volumeseriennummer: 3855-15E8
Verzeichnis von C:\Dokumente und Einstellungen\M.Kaening\Eigene Dateien\hg-wc
20.06.2008 16:56 <DIR> .
20.06.2008 16:56 <DIR> ..
20.06.2008 16:56 <DIR> .hg
20.06.2008 16:55 44 file-öäü.txt
1 Datei(en) 44 Bytes
3 Verzeichnis(se), 38.812.647.424 Bytes frei
C:\Dokumente und Einstellungen\M.Kaening\Eigene Dateien\hg-wc>
------------------------------------------------------------------------------
As you can see, Mercurial or TortoiseHg does NOT set file name like
TortoiseSVN would do. The file name is not adapted to the current charset,
how SVN would do it.
That's what I mean here. I THINK THAT'S INCONSISTENT, BECAUSE NOT
PORTABLE.
Up to now I haven't figured out, which parameter for --encoding or
HGENCODING I should use to make it work. My console seems to be set to
cp850, the system might be set to cp1252, if I understood right that the
following regkey is the one to believe:
------------------------------------------------------------------------------
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP
------------------------------------------------------------------------------
The content of the file is correct. No change, it stays UTF-8.
More information about the Mercurial
mailing list