Managing multiple encodings in one repository

Thu Apr 5 11:47:50 CDT 2007

On Thu, Apr 05, 2007 at 08:11:06PM +0400, David Rushby wrote:
> On 4/5/07, Matt Mackall <mpm at selenic.com> wrote:
> >>      If I save Mercurial.ini as (for example) UTF-8, then specify
> >> "--encoding=utf8" or environment variable HGENCODING=utf8, the
> >> username emerges is garbled.
> >
> >What precisely is happening? Is Mercurial properly reading your .ini
> >as UTF-8 and then displaying it as UTF-8, which your console tries to
> >interpret as Windows-1251? This will manifest as all the non-ASCII
> >characters being represented as multiple characters.
> 
> No, that's not what's happening.  Mercurial is try to pretend that the
> contents of Mercurial.ini are stored in the system default encoding,
> even when I specify another encoding.
> 
> Here's a simple way to reproduce the problem (on Windows, at least):
> ---
> 1) create an empty directory
> 
> 2) within that directory,
>     hg init
> 
> 3) hg status
>    now prints nothing, as expected.
> 
> 4) Start wordpad.exe, and paste the following into it:
>      [ui]
>      username = Someone <someone at somewhere.com>
> 
>    Note that there are no non-ASCII characters there, so printing
> them to the console should not present any problems, regardless of
> what code page the console is configured to use.
> 
> 5) Use {File->Save As} to cause wordpad to replace your Mercurial.ini
> file, specifying "Text document (Unicode)" as the file type.  This
> writes the file in UTF-16.
> 
> 6) hg status --encoding=utf16
>    Now dies with a message like:
> """
> abort: Failed to parse C:\Documents and Settings\Rushby\mercurial.ini
> File contains no section headers.
> file: C:\Documents and Settings\Rushby\mercurial.ini, line: 1
> '\xff\xfe[\x00u\x00i\x00]\x00\r\x00\n'
> """

UTF-16 is a whole 'nother story. If this had been UTF-8 without the
stupid BOM marker (\xff\xfe), this would have worked just fine.

> Mercurial is going right ahead and trying to read Mercurial.ini as if
> it were encoded in the system default encoding.

No. It's reading it as raw bytes. Again, Mercurial -only- transcodes
those bits of data it knows how to transcode. Here, that means the
only line in the .ini file that gets transcoded is the username. That
happens after parsing. Everything else it leaves untouched. 8-bit
clean, remember? This breaks because you've gone to 16 bits!

Ok, actually, it broke before that because it doesn't have any idea
what to make of the first two garbage bytes on the first line.
Otherwise, you would have gotten a nice header that looked like
"\0u\0i\0", which wouldn't have matched "ui".

> >>  2) Be able to see encoding-normalized output from commands that
> >> might operate on files with different encodings.
> >
> >There's really no good way to deal with this problem. Firstly, because
> >there's no well-defined way to identify a file's encoding (indeed, it
> >could have -many- in the same file). And secondly, because it's a bad
> >idea for the tool to presume to modify data that it doesn't own.
> >Character encoding/decoding is lossy, confusing, and frequently
> >misconfigured, so it's a good way to silently corrupt things. So we
> >only encode and decode Mercurial metadata, everything else is saved,
> >stored, and displayed as-is. This is also known as being '8-bit
> >clean'.
> 
> That makes sense.  What do you think of the feasibility of writing a
> plug-in (for my own use) that would intercept any attempt by Mercurial
> to a read file with a certain extension, examine the file for an
> encoding specification, and load a normalized representation of the
> file before Mercurial "gets its hands on" the contents?

See the encode and decode filters:

http://www.selenic.com/mercurial/wiki/index.cgi/EncodeDecodeFilter

-- 
Mathematics is the supreme nostalgia of our time.