Managing multiple encodings in one repository

Thu Apr 5 00:21:43 CDT 2007

On Wed, Apr 04, 2007 at 03:49:00PM +0400, David Rushby wrote:
> But Mercurial's encoding support seems to have plenty of cracks (I'm
> using the latest version from http://selenic.com/repo/hg ).  I'm
> trying to figure how to accomplish the following on Windows 2000 (in a
> single repository):
> ---
>  1) Explicitly specify the encoding of my Mercurial.ini file.
>      Although Mercurial detects the system preferred encoding
> (Windows-1251 Cyrillic), and correctly extracts the contents of
> Mercurial.ini with that encoding, I also need to work on Windows
> installations where the preferred encoding cannot represent characters
> in my Cyrillic username.

So far, so good.

>      If I save Mercurial.ini as (for example) UTF-8, then specify
> "--encoding=utf8" or environment variable HGENCODING=utf8, the
> username emerges is garbled.

What precisely is happening? Is Mercurial properly reading your .ini
as UTF-8 and then displaying it as UTF-8, which your console tries to
interpret as Windows-1251? This will manifest as all the non-ASCII
characters being represented as multiple characters.

If this is the case, what you need is a UTF-8 console. Theoretically,
this can be done with 'chcp 65001'. But that's just what I hear on the
internets. Let me know if it helps.

>  2) Be able to see encoding-normalized output from commands that
> might operate on files with different encodings.
>      For example, "hg diff" when uncommitted changes have been made
> in a UTF-8-encoded file and a Windows-1251-encoded file.  Currently, I
> can specify "hg diff --encoding=utf8" and see garbage in the diff of
> the Windows-1251 file, or "hg diff --encoding=cp1251" and see garbage
> in the UTF-8.
>      When generating text for display, wouldn't it be possible to
> normalize the output to the encoding that the user has specified,
> rather than just dumping whatever happens to be in the file?

There's really no good way to deal with this problem. Firstly, because
there's no well-defined way to identify a file's encoding (indeed, it
could have -many- in the same file). And secondly, because it's a bad
idea for the tool to presume to modify data that it doesn't own.
Character encoding/decoding is lossy, confusing, and frequently
misconfigured, so it's a good way to silently corrupt things. So we
only encode and decode Mercurial metadata, everything else is saved,
stored, and displayed as-is. This is also known as being '8-bit
clean'.

>  3) Be able to specify the encoding with "hg commit".  Currently,
> since my username is non-UTF-8 (and must remain that way, since
> Mercurial.ini currently must be in the system preferred encoding), if

I don't think this constraint is correct.

> I try to issue the command:
>        hg commit --encoding=utf8 -m "blah"
>      It fails because the Windows-1251 representation of my username
> isn't valid UTF-8, and Mercurial apparently isn't encoding the
> username to UTF-8 before attempting to include it with other UTF-8 for
> the commit.

You can use -u to work around this, but it really sounds like you want
to go to a full UTF-8 environment.

>  5) Have Mercurial make the active encoding setting available when
> calling external tools such as hgmerge.
>      Mercurial could accomplish this by setting a subprocess
> environment variable whenever it spawns an external tools.  For
> example, if I have set my encoding via "--encoding=..." instead of via
> the HGENCODING environment variable, then Mercurial should fabricate
> the HGENCODING environment variable when it spawns a subprocess.

This is definitely a good idea. We should always export an HGENCODING.

-- 
Mathematics is the supreme nostalgia of our time.