[PATCH 4 of 8] encode all output in stdio encoding

Mon Nov 20 14:04:04 CST 2006

On 21 November 2006 (Tue) 01:15, Matt Mackall wrote:
> On Mon, Nov 20, 2006 at 04:34:20PM -0200, Alexis S. L. Carvalho wrote:
> > Thus spake Alexis S. L. Carvalho:
> > > Thus spake Andrey:
> > > > I should better have written something like 'encode all output in
> > > > stdio encoding, if not already encoded' in commit message. :) That
> > > > ui.ui.encode() function leaves all non-Unicode strings untouched, so
> > > > hg cat works as expected.
> > >
> > > It prints a traceback with hg log --patch with a revision that changes
> > > the encoding of a file.
> >
> > Hmm...  ok, it doesn't even get to ui.write - the current log code puts
> > all strings in a list and does a ui.write("".join(strings)).  This
> > patchset changes some of these strings from str's to unicode's, and so
> > the "".join() raises an exception when it fails to convert the patch to
> > a unicode.
>
> This is a great example of why having a mix of Unicode and regular
> strings in an app travelling the same paths is generally Not A Good
> Idea. Especially as one of our primary concerns as an SCM is to pass
> all data through the system unmangled.
>
> Regular strings never throw exceptions. Functions that were written to
> work on regular strings will explode in unexpected places when passed
> unicode strings. That's bad. And retrofitting code to accept both is
> complicated.
>
> Especially given that we generally _don't_ know the encoding of the
> data we're manipulating. As far as I know, Unicode doesn't have an
> encoding that says "I don't know what this is, it might be binary for
> all I know, don't complain, and when you encode it back to 8-bit, it
> must be exactly identical."
>
> Going the other way, manipulations on regular encoded strings will
> generally work. Operations that fail are things like upper(), lower(),
> grep with mismatched encodings, and truncation that happens to chop
> inside a character. And their failure modes are relatively harmless.
> For instance, about the only significant user of lower is log -k,
> which will continue to work roughly as advertised.

Indeed, it was a bad idea to treat Unicode and byte strings in the same way. 
But that does not mean we should not use Unicode at all. We just have to 
clearly distinguish between Unicode data and byte data. For example, log 
messages are obviously Unicode data, and so are user names, because they 
represent textual information and their exact byte representation is 
unimportant. And contents of revision controlled files (and thus diffs and 
grep results) is byte data and it is a good idea to use byte strings for it. 
The problems arise when we are trying to treat byte strings as Unicode 
strings and vice versa. For example, byte strings must be sent to (or read 
from) the terminal as-is, while Unicode strings have to be encoded with 
proper encoding before output and decoded after input. And every 
UnicodeDecodeException says that something is going wrong with our encodings 
and needs to be fixed. For example, Unicode data is not properly encoded 
before writing it to stdout. If we use UTF-8 byte strings instead of Unicode, 
we will not get any exceptions, but the bugs will not vanish by themselves. 
They will just get less obvious. So I'd say that using UTF-8 byte strings is 
burying our heads in the sand. And in fact, Python 3000 is going to get rid 
of old goot byte strings at all (Unicode strings will be used by default, 
there will be also mutable 'bytes' type without any string-like methods, 
which will never be coerced to Unicode automatically).

Andrey