[PATCH 2 of 6] Determine default locale encoding and stdio encoding on start-up

Mon Nov 13 12:36:39 CST 2006

On 14 November 2006 (Tue) 00:00, Matt Mackall wrote:
> > I actually borrowed most of the code from
> > http://www.selenic.com/mercurial/bts/issue156. :-) Having two different
> > encodings is nessessary on Windows and maybe on other esotheric systems.
> > And 'stdio_encoding' option could be useful if autodetection of encoding
> > fails.
>
> Just because Windows does it doesn't mean its useful. What are the
> scenarios where Windows needs it?

Although Windows claims to be Unicode-aware, it still uses 8-bit encodings 
everywhere. :) For example, my investigations of Windows intallation with 
Russian locale showed that notepad.exe produces text file in Windows-1251 
encoding. The same encoding is used for command line arguments passed to 
Python scripts. And it is the encoding returned by 
locale.getpreferredencoding(). But at the same time Windows uses CP866 
(legacy Cyrillic encoding from DOS days) for console IO, probably for 
compatibility with old DOS apps. It means, sys.stdin.read() returns byte 
strings in CP866 encoding, and sys.stdout.write() requires its arguments to 
be encoded in CP866. If we just use locale.getpreferredencoding() for that, 
non-latin log messages and other texts will be displayed incorrectly when 
written to stdout. So we really have to use differrent encoding for stdio. 
And better make it user-overridable, because no one knows what other quirks 
Windows has. :) Well, for me Windows support is not of great importance, but 
it is still nice to have.

> > It would be nice indeed to move that code to util.py, but it needs access
> > to config, and for some reason config loading is done in ui.py (I'd
> > personally prefer having separate config.py module and read config file
> > on first module import). Could someone comment on this?
>
> I'm not yet convinced locale support needs access to the config. If
> the average internationalized app needed its own config tweaks,
> everyone would just give up and use ASCII.

As I noticed, ASCII-speakers tend to underestimate the importance of proper 
support for other encodings. ;) Autodetection will probably work most of the 
time, but not always, and those config options could be really helpful for 
manually resolving the most complex cases. :)

Andrey