[PATCH] encoding: prefer the detected console encoding by default (issue2926)

Andrei Polushin polushin at gmail.com
Wed Jul 27 17:39:59 CDT 2011


(quoting the last paragraph first)

28.07.2011 3:11, Matt Mackall wrote:
> Cyrillic seems to be a worst-case here: little overlap with ASCII and
> two encodings. Unless you're working with DOS-era apps that actually
> care about the old encoding (ie they want the line-drawing characters),
> I'd recommend permanently switching your console to cp1251.

Let me shed some more light on this. Several years of experience taught us
to write Cyrillic programs this way: use OEM encoding for console programs,
but use ANSI encoding for GUI programs. This is the simplest possible
approach to run our programs when they later given to other users, friends
and colleagues. I just can't tell them all to switch their console encodings
"permanently", because it would require a sort of social revolution.

Moreover, switching is mostly possible when running from cmd, using `chcp`.
When the user starts a console program from the Windows Explorer, he don't
even have an option to switch the console encoding himself. In this case,
the encoding is initially OEM. While it could be switched programmatically,
it would require rewriting existing programs.

There should be a very strong reason to switch the console encoding
permanently, accompanied by a long list of issues to be solved.

The overall result is that I'm unable to recommend the recently translated
Mercurial for Russian users, because I'm still looking for a reliable way to
deal with its garbled translated messages printed to console.

> On Thu, 2011-07-28 at 02:30 +0700, Andrei Polushin wrote:
>> Looks
>> like it would be permissible to recode the console output separately from
>> other things. What do you think?
> 
> a) That's an incredibly large piece of work

It would be nice to have some more info on this.

> b) It would be very backwards-incompatible
> c) It would STILL be buggy (because having different encodings on the
> console and GUI as Windows sometimes does is inherently broken)
> 
> Point (c) isn't very obvious, so I'll elaborate. Consider this:
> 
>> hg log -r tip > log.out
>> type log.out
> 
> What encoding should we get here? If we choose the GUI encoding, we'll
> get a mysteriously wrong result on the console. If we choose the console
> encoding, we'll get a similarly wrong result when we try to open log.out
> in our editor. We can't know how log.out is going to be used, so its
> impossible to get it right every time. We just have to pick one. So we
> pick the GUI one.

I agree, and I think the user should have a choice of encoding here.

Actually, it doesn't matter what the file encoding is. Cyrillic users are
less concerned about file encoding, they will rather avoid using `type` to
print cp1251-encoded file onto cp866-encoded console. There is no surprise
for them. They would prefer using a correct editor/viewer to view the file.
Personally, my console is wrapped with Far Manager program, it's editor
detects the file encoding automatically and have a quick command to switch
the encoding.

Our main concern is the _console_ output, not the file output.

> Ok, so what if we're clever and do something different when a stdout is
> not redirected (putting aside for the moment that this would be a very
> large change to Mercurial's design)? Now we get:
> 
> [...]
> 
> We haven't really fixed anything, we've just introduced magic. This
> magic will be extremely confusing to anyone who comes along and tries to
> write a tool wrapped around Mercurial because things that clearly work
> on the command line will stop working when they try to use it in a
> program. 

>From now I reject my initial suggestion about using different encoding for
redirected stdout. It would be confusing, I agree.

Let's stick to the _single_ encoding for console output, whether it be
redirected or not.

> We also can't do very obvious things like:
> 
>> hg log -r tip | more
> [...]
> 
> The only sensible answers are to always use the GUI encoding and live
> with broken output in the console or vice-versa. In 2011, I think it's
> pretty clear that favoring the GUI is the right answer, even for a
> console app.

This case also has solutions in OEM-encoded console.

One option is that the log is better viewed with the GUI app, while commands
are better typed within console, so just use different tools for different
tasks, i.e. use TortoiseHg to see logs and diffs.

Another option is to redirect to a file and view it. The similar option I
use occasionally is to redirect to a viewer/editor, as long as it supports
feeding from pipe and detects encoding, as I've told above. That's why I
never use `more`, it's mostly useless in my world.

--

To repeat myself, I'm only concerned about those informational messages that
are output to the console.

As far as I can tell up to this point, setting HGENCODING=cp866 is not an
option for me, because it affects too much. I've discovered that it affects
even TortoiseHg GUI: it becomes displaying repository files in different
encoding. This is unacceptable, because Cyrillic files are quite normally
exchanged in cp1251 encoding.

That is, the only workaround I've found is to specify encoding explicitly on
the command line, using `hg --encoding=cp866`.

I'm still looking for a better solution, however. What if there would be
another environment variable, say HGOUTPUTENCODING= (possible options are:
ansi, oem, cp866, cp1251, etc.), that would affect _only_ the output printed
to console and redirected from there, not the commit messages or like?

Having such a distinct option should not break backwards compatibility
unless it will be set explicitly. Even if set, it would only break those
programs that call Mercurial as a command line program, not through API.

Still looking for a place for a small patch to fix this. Couldn't it be
fixed by decoding/encoding in winstdout class for now? I suppose strings
supplied for write should be encoded as per HGENCODING, while it will
reencode them into HGOUTPUTENCODING and output to console.

This would also gracefully solve your `hg log | more` task for cp866.

--
Andrei Polushin


More information about the Mercurial-devel mailing list