Current py3k stage and next steps

Sun Jun 27 06:50:24 CDT 2010

Matt Mackall <mpm at selenic.com> writes:

> On Fri, 2010-06-25 at 07:44 +0000, Antoine Pitrou wrote:
>> Hi,
>> 
>> > To do that, I'd have to define a compatibility layer for
>> > str/bytes... Martin Geisler commented on IRC that I could use Uche
>> > Mennel's ustr[1] to separate strings and unicode objects. Another
>> > approach would be to use Martin v. Löwis' py3 module[2]. Maybe
>> > integrating both approaches would be a nice way of doing it,
>> > defining u to be ustr in 2.x and str in py3k...
>> 
>> I'm not sure what you need ustr for. If Mercurial already enforces
>> proper bytes / unicode separation (which I assume it does), you
>> shouldn't need an additional type to enforce it for you. Actually,
>> porting to py3k is the way to verify that there is no issue there.
>
> There are basically no Unicode objects "in the wild" in Mercurial.
> Their usage is more or less restricted to a couple transcoding
> function in encoding.py where they can't hurt anybody.
>
> The tricky part is this:
>
> ui.write() and the like are used to handle three kinds of data:
>
> - utf-8 encoded metadata that's been transcoded to the local encoding
> - internal ASCII messages that may or may not go through gettext()
> before being present to the user in the local encoding
> - raw byte data that is presented to the user byte-for-byte as-is
>
> In the last case, it's unacceptable to do any form of transcoding even
> if we knew what encoding the data was in (which we don't and which is
> not possible in the general case). Also note that these strings may be
> hundreds of megabytes - even an extra copy (let alone blowing it up
> 2-4x) may not be acceptable.
>
> So we'll have:
>
> a) ui.write(repo[rev].user())  # username is transcoded to local
> encoding
> b) ui.write(_("abort: can't do that")) # translated and possibly
> transcoded
> c) ui.write("debug message") # debug messages aren't translated
> d) ui.write(repo[rev][file].data()) # raw file data
>
> We also have many instances of:
>
> e) ui.write("debug message: %s\n" % somerawdata) # cases c and d
> f) ui.write(_("some message: %s\n") % somerawdata) # cases b and d
>
> This generally all works smoothly because data is either 1) received
> in the local encoding 2) uniformly converted to the local encoding as
> soon as possible or 3) left completely unmolested.

I once made an experiment where I changed the type of user interface
strings from str to unicode. This worked quite smoothly, except for this
little part in encoding.fromlocal:

    try:
        return s.decode(encoding, encodingmode).encode("utf-8")
    except UnicodeDecodeError, inst:
        sub = s[max(0, inst.start - 10):inst.start + 10]
        raise error.Abort("decoding near '%s': %s!" % (sub, inst))

Here sub is part of a str which we know we cannot decode and yet we try
to output it directly to the user -- not good.

> Enter py3k. Cases (a) and (d) are pretty straightforward. And we can
> even managed (b) by teaching ui.write to handle Unicode objects
> containing ASCII without complaint.
>
> But (e) gets us in trouble before Mercurial's even involved, right at
> the % operator. This operation is only correct on bytestrings but we'd
> need to add a b"" to all our string manipulations to be safe. Which is
> a maintenance nightmare of epic proportions.

I think that is actually the came I'm talking about above -- it doesn't
even work today.

> 2to3 can be taught to do this, but we can't do it to the main codebase
> (even if we wanted to inflict such a horror on ourselves!) as we'll be
> supporting 2.4 and 2.5 for a few years yet.
>
> Relatedly, I expect many functions in the standard library are going to
> begin handing back Unicode results, so we'll have to wrap everything in
> a thick layer of duct tape.
>
>
> It's instructive to note a core asymmetry here:
>
> - Any Unicode string can be converted to a bytestring and back
> losslessly via UTF-8
>
> - An arbitrary bytestring -cannot- be converted losslessly to Unicode
> and back

Right, which is why you don't want to convert your bytestrings to
Unicode strings?

> If Unicode had, say, a codeplane to represent "unknown byte 0x??" such
> that arbitrary byte strings could round-trip losslessly to Unicode,
> none of this would be a problem (except for overhead). But since
> that's not possible, Unicode strings are a bad fit for much of what
> Mercurial does.

I think all the user interface strings could and should be unicode
objects. The raw data should be str objects. The goal of the ustr class
(the unicode-like class that don't mix with str objects) is to let us
introduce ustr objects in the code and make sure we don't mix them with
str objects by mistake.

-- 
Martin Geisler

Mercurial links: http://mercurial.ch/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20100627/0137e0d4/attachment.pgp>