Current py3k stage and next steps

Sun Jun 27 15:18:56 CDT 2010

On Sun, 2010-06-27 at 13:50 +0200, Martin Geisler wrote:
> Matt Mackall <mpm at selenic.com> writes:
> 
> > On Fri, 2010-06-25 at 07:44 +0000, Antoine Pitrou wrote:
> >> Hi,
> >> 
> >> > To do that, I'd have to define a compatibility layer for
> >> > str/bytes... Martin Geisler commented on IRC that I could use Uche
> >> > Mennel's ustr[1] to separate strings and unicode objects. Another
> >> > approach would be to use Martin v. Löwis' py3 module[2]. Maybe
> >> > integrating both approaches would be a nice way of doing it,
> >> > defining u to be ustr in 2.x and str in py3k...
> >> 
> >> I'm not sure what you need ustr for. If Mercurial already enforces
> >> proper bytes / unicode separation (which I assume it does), you
> >> shouldn't need an additional type to enforce it for you. Actually,
> >> porting to py3k is the way to verify that there is no issue there.
> >
> > There are basically no Unicode objects "in the wild" in Mercurial.
> > Their usage is more or less restricted to a couple transcoding
> > function in encoding.py where they can't hurt anybody.
> >
> > The tricky part is this:
> >
> > ui.write() and the like are used to handle three kinds of data:
> >
> > - utf-8 encoded metadata that's been transcoded to the local encoding
> > - internal ASCII messages that may or may not go through gettext()
> > before being present to the user in the local encoding
> > - raw byte data that is presented to the user byte-for-byte as-is
> >
> > In the last case, it's unacceptable to do any form of transcoding even
> > if we knew what encoding the data was in (which we don't and which is
> > not possible in the general case). Also note that these strings may be
> > hundreds of megabytes - even an extra copy (let alone blowing it up
> > 2-4x) may not be acceptable.
> >
> > So we'll have:
> >
> > a) ui.write(repo[rev].user())  # username is transcoded to local
> > encoding
> > b) ui.write(_("abort: can't do that")) # translated and possibly
> > transcoded
> > c) ui.write("debug message") # debug messages aren't translated
> > d) ui.write(repo[rev][file].data()) # raw file data
> >
> > We also have many instances of:
> >
> > e) ui.write("debug message: %s\n" % somerawdata) # cases c and d
> > f) ui.write(_("some message: %s\n") % somerawdata) # cases b and d
> >
> > This generally all works smoothly because data is either 1) received
> > in the local encoding 2) uniformly converted to the local encoding as
> > soon as possible or 3) left completely unmolested.
> 
> I once made an experiment where I changed the type of user interface
> strings from str to unicode. This worked quite smoothly, except for this
> little part in encoding.fromlocal:

Ok, let's stop right here and go back and look at (f). Please propose a
solution where _() returns Unicode and doesn't break (ie matches current
behavior) when somerawdata is impossible to decode. Keep in mind that
your solution must be simple enough that it can be applied to every
place in the source where % and + are used on strings. It's going to
need to be maintainable on 2.4-3.x, and tests on 2.x need to catch all
the places where we do it wrong.

And your answer needs to be considerably more detailed than 'use ustr',
which doesn't begin to tell us how to actually deal with the % in the
example. 

-- 
Mathematics is the supreme nostalgia of our time.