Managing multiple encodings in one repository

David Rushby davidrushby at gmail.com
Wed Apr 4 06:49:00 CDT 2007


Hi.  I just read the current draft of Bryan O'Sullivan's book
_Distributed revision control with Mercurial_, and I'm beyond
impressed.  Not only is the design of the tool itself exceptionally
well considered, but the organization of the book and the quality of
the prose is very, very good.

Reading the chapters about Mercurial Queues made me want to cry for
all the time I've wasted performing such tasks manually :)

However, it seems there are plenty of rough edges in the
implementation.  Getting up and running on the Russian version of
Windows 2000 with hgk and a graphical merge tool (kdiff3) has been a
painful process (but a successful one, after I built Mercurial from
source and installed Windows-compatible infrastructure for hgk and
hgmerge->kdiff3).

Currently, the major stumbling block is working with different
encodings in the same repository.  I've read the wiki entry
Character_Encoding_On_Windows, and I know about changing the code page
for the cmd.exe console, and so on.

But Mercurial's encoding support seems to have plenty of cracks (I'm
using the latest version from http://selenic.com/repo/hg ).  I'm
trying to figure how to accomplish the following on Windows 2000 (in a
single repository):
---
  1) Explicitly specify the encoding of my Mercurial.ini file.
      Although Mercurial detects the system preferred encoding
(Windows-1251 Cyrillic), and correctly extracts the contents of
Mercurial.ini with that encoding, I also need to work on Windows
installations where the preferred encoding cannot represent characters
in my Cyrillic username.
      If I save Mercurial.ini as (for example) UTF-8, then specify
"--encoding=utf8" or environment variable HGENCODING=utf8, the
username emerges is garbled.  If I save Mercurial.ini as UTF-16 and
then specify that encoding, Mercurial can't read the file at all.  It
appears that Mercurial pays no attention to the explicit encoding
setting when reading Mercurial.ini.
      I took a look at the source code, and it appears that this could
be fixed in mercurial/ui.py:ui:readconfig by changing the "fp =
open(f)" statement to open the file with "fp = codecs.open(f,
encoding=...explicitly_specified_encoding...)".

  2) Be able to see encoding-normalized output from commands that
might operate on files with different encodings.
      For example, "hg diff" when uncommitted changes have been made
in a UTF-8-encoded file and a Windows-1251-encoded file.  Currently, I
can specify "hg diff --encoding=utf8" and see garbage in the diff of
the Windows-1251 file, or "hg diff --encoding=cp1251" and see garbage
in the UTF-8.
      When generating text for display, wouldn't it be possible to
normalize the output to the encoding that the user has specified,
rather than just dumping whatever happens to be in the file?

  3) Be able to specify the encoding with "hg commit".  Currently,
since my username is non-UTF-8 (and must remain that way, since
Mercurial.ini currently must be in the system preferred encoding), if
I try to issue the command:
        hg commit --encoding=utf8 -m "blah"
      It fails because the Windows-1251 representation of my username
isn't valid UTF-8, and Mercurial apparently isn't encoding the
username to UTF-8 before attempting to include it with other UTF-8 for
the commit.

  4) Have the "hg serve" web interface work properly with multiple
encodings.  It currently sets the "charset" clause of the HTTP
"Content-Type" header properly, but suffers from the same problem as
"hg diff" when multiple encodings are involved.
      If there's a changeset that includes changes from files with
different encodings, I see garbage for all but one of the encodings,
depending on the encoding setting that was in force when the web
server started.  The output would ideally be normalized.

  5) Have Mercurial make the active encoding setting available when
calling external tools such as hgmerge.
      Mercurial could accomplish this by setting a subprocess
environment variable whenever it spawns an external tools.  For
example, if I have set my encoding via "--encoding=..." instead of via
the HGENCODING environment variable, then Mercurial should fabricate
the HGENCODING environment variable when it spawns a subprocess.
---

Are there any fundamental limitations in Mercurial's internals that
make these problems insurmountable?

If not, are the Mercurial developers interested in accepting patches
to fix them?  If so, on which repository should the patches be based?
    http://selenic.com/repo/hg
    http://selenic.com/repo/hg-stable
    http://hg.intevation.org/mercurial/crew

Thanks.


More information about the Mercurial mailing list