Managing multiple encodings in one repository

Thu Apr 5 11:11:06 CDT 2007

On 4/5/07, Matt Mackall <mpm at selenic.com> wrote:
> >      If I save Mercurial.ini as (for example) UTF-8, then specify
> > "--encoding=utf8" or environment variable HGENCODING=utf8, the
> > username emerges is garbled.
>
> What precisely is happening? Is Mercurial properly reading your .ini
> as UTF-8 and then displaying it as UTF-8, which your console tries to
> interpret as Windows-1251? This will manifest as all the non-ASCII
> characters being represented as multiple characters.

No, that's not what's happening.  Mercurial is try to pretend that the
contents of Mercurial.ini are stored in the system default encoding,
even when I specify another encoding.

Here's a simple way to reproduce the problem (on Windows, at least):
---
1) create an empty directory

2) within that directory,
     hg init

3) hg status
    now prints nothing, as expected.

4) Start wordpad.exe, and paste the following into it:
      [ui]
      username = Someone <someone at somewhere.com>

    Note that there are no non-ASCII characters there, so printing
them to the console should not present any problems, regardless of
what code page the console is configured to use.

5) Use {File->Save As} to cause wordpad to replace your Mercurial.ini
file, specifying "Text document (Unicode)" as the file type.  This
writes the file in UTF-16.

6) hg status --encoding=utf16
    Now dies with a message like:
"""
abort: Failed to parse C:\Documents and Settings\Rushby\mercurial.ini
File contains no section headers.
file: C:\Documents and Settings\Rushby\mercurial.ini, line: 1
'\xff\xfe[\x00u\x00i\x00]\x00\r\x00\n'
"""
---

Mercurial is going right ahead and trying to read Mercurial.ini as if
it were encoded in the system default encoding.  If I replace the line
  fp = open(f)
in ui.py:ui:readconfig with
  import codecs; fp = codecs.open(f, encoding='utf16')
then Mercurial is able to read a UTF-16-encoded Mercurial.ini.
Obviously, a real fix would need to use the active Mercurial encoding
instead of hard-coded 'utf16'.

> >  2) Be able to see encoding-normalized output from commands that
> > might operate on files with different encodings.
>
> There's really no good way to deal with this problem. Firstly, because
> there's no well-defined way to identify a file's encoding (indeed, it
> could have -many- in the same file). And secondly, because it's a bad
> idea for the tool to presume to modify data that it doesn't own.
> Character encoding/decoding is lossy, confusing, and frequently
> misconfigured, so it's a good way to silently corrupt things. So we
> only encode and decode Mercurial metadata, everything else is saved,
> stored, and displayed as-is. This is also known as being '8-bit
> clean'.

That makes sense.  What do you think of the feasibility of writing a
plug-in (for my own use) that would intercept any attempt by Mercurial
to a read file with a certain extension, examine the file for an
encoding specification, and load a normalized representation of the
file before Mercurial "gets its hands on" the contents?

In Python source files, for example, one can use an encoding directive
on the first line, like this:
  #-*- coding: utf8 -*-
to inform Python of the file's encoding.

If I could (for example) write a Mercurial plug-in that would
intercept Mercurial's attempts to read the contents of .py files, and
convert the text to a normalized representation before Mercurial
processed it, that would solve my problems.  However, it appears to me
that Mercurial currently has no hooks that would be adequate for that
purpose.

Thanks.