[PATCH] encoding: prefer the detected console encoding by default (issue2926)

Matt Mackall mpm at selenic.com
Wed Jul 27 15:11:59 CDT 2011


On Thu, 2011-07-28 at 02:30 +0700, Andrei Polushin wrote:
> 28.07.2011 1:51, Matt Mackall wrote:
> > On Thu, 2011-07-28 at 01:45 +0700, Andrei Polushin wrote:
> >> encoding: prefer the detected console encoding by default (issue2926)
> >>
> >> The default installation of Mercurial should use the Windows OEM encoding
> >> when running within the console window, which uses OEM encoding by default.
> > 
> > This won't work. Among the very long list of things this will break, it
> > will garble commit messages created by Notepad or other graphical tools
> > launched from the command line.
> > 
> 
> OK, I accept your objection about Notepad as a compatibility issue. Looks
> like it would be permissible to recode the console output separately from
> other things. What do you think?

a) That's an incredibly large piece of work
b) It would be very backwards-incompatible
c) It would STILL be buggy (because having different encodings on the
console and GUI as Windows sometimes does is inherently broken)

Point (c) isn't very obvious, so I'll elaborate. Consider this:

> hg log -r tip > log.out
> type log.out

What encoding should we get here? If we choose the GUI encoding, we'll
get a mysteriously wrong result on the console. If we choose the console
encoding, we'll get a similarly wrong result when we try to open log.out
in our editor. We can't know how log.out is going to be used, so its
impossible to get it right every time. We just have to pick one. So we
pick the GUI one.

Ok, so what if we're clever and do something different when a stdout is
not redirected (putting aside for the moment that this would be a very
large change to Mercurial's design)? Now we get:

> hg log -r tip
changeset:   10:8d36e15e82c8
tag:         tip
user:        Matt Mackall <mpm at selenic.com>
date:        Mon Nov 15 16:56:48 2010 -0600
summary:     <some Cyrillic text>
> hg log -r tip > log.out
> type log.out
changeset:   10:8d36e15e82c8
tag:         tip
user:        Matt Mackall <mpm at selenic.com>
date:        Mon Nov 15 16:56:48 2010 -0600
summary:     <some unreadable gibberish>

(this also ignores that "changeset:" etc, may be translated.)

We haven't really fixed anything, we've just introduced magic. This
magic will be extremely confusing to anyone who comes along and tries to
write a tool wrapped around Mercurial because things that clearly work
on the command line will stop working when they try to use it in a
program. 

We also can't do very obvious things like:

> hg log -r tip | more
changeset:   10:8d36e15e82c8
tag:         tip
user:        Matt Mackall <mpm at selenic.com>
date:        Mon Nov 15 16:56:48 2010 -0600
summary:     <some unreadable gibberish>


The only sensible answers are to always use the GUI encoding and live
with broken output in the console or vice-versa. In 2011, I think it's
pretty clear that favoring the GUI is the right answer, even for a
console app.

> Otherwise, the issue still remains, and the international Windows users
> should be at least noticeably warned about that the default installation
> require additional tuning, and currently there is no way to tune it once and
> for all.

I'm certainly open to improving the docs on this.

It's not actually specific to international users. US copies of Windows
have different console and GUI encodings, though they're largely immune
to the issue due to working almost entirely in ASCII. But if they're
collaborating with people with funny letters in their names or working
on internationalized apps (like hg!), they'll still run into it.

Lots of European locales have the same encoding for console and GUI and
again have a large enough overlap with ASCII that it's just a nuisance.
Not entirely sure what the situation is in Asia, but I think they may
also not have the split (ie Shift-JIS predates Windows).

Cyrillic seems to be a worst-case here: little overlap with ASCII and
two encodings. Unless you're working with DOS-era apps that actually
care about the old encoding (ie they want the line-drawing characters),
I'd recommend permanently switching your console to cp1251.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list