Current py3k stage and next steps

Matt Mackall mpm at selenic.com
Sun Jun 27 15:29:55 CDT 2010


On Fri, 2010-06-25 at 22:21 +0000, Antoine Pitrou wrote:
> Matt Mackall <mpm <at> selenic.com> writes:
> > 
> > The tricky part is this:
> > 
> > ui.write() and the like are used to handle three kinds of data:
> > 
> > - utf-8 encoded metadata that's been transcoded to the local encoding
> > - internal ASCII messages that may or may not go through gettext()
> > before being present to the user in the local encoding
> > - raw byte data that is presented to the user byte-for-byte as-is
> > 
> > In the last case, it's unacceptable to do any form of transcoding even
> > if we knew what encoding the data was in (which we don't and which is
> > not possible in the general case).
> 
> The 3.x IO subsystem is layered: sys.stdout is a text (unicode) layer, but you
> can access sys.stdout.buffer which is the underlying buffered bytes layer. Of
> course, it is better to call flush() in-between (even though it might not appear
> necessary in interactive use):
> 
> >>> sys.stdout.encoding
> 'UTF-8'
> >>> sys.stdout.write("some unicode text: é\n")
> some unicode text: é
> 21
> >>> sys.stdout.buffer.write(b"some undecodable bytes: \x00\xff\n")
> some undecodable bytes: �
> 27

Yeah, I'm not even worried about the actual I/O part. I'm just using
ui.write as an example of a place where we're routinely combining
strings from different sources for formatting purposes. Focus on the
combining strings part, please.

> > If Unicode had, say, a codeplane to represent "unknown byte 0x??" such
> > that arbitrary byte strings could round-trip losslessly to Unicode, none
> > of this would be a problem (except for overhead). But since that's not
> > possible, Unicode strings are a bad fit for much of what Mercurial
> > does. 
> 
> Python 3.x does allow you to roundtrip all data through unicode and back to
> bytes, losslessly: by using the "surrogateescape" error handler. It translates
> all undecodable bytes to lone unicode surrogates, and does the reverse operation
> when encoding:
> 
> >>> b"valid UTF-8: \xc3\xa9 ; invalid UTF-8: \xff".decode("utf-8",
> "surrogateescape")
> 'valid UTF-8: é ; invalid UTF-8: \udcff'
> >>> 'valid UTF-8: é ; invalid UTF-8: \udcff'.encode("utf-8", "surrogateescape")
> b'valid UTF-8: \xc3\xa9 ; invalid UTF-8: \xff'
> 
> Moreover, the same unicode string cannot be produced by a legal UTF-8 sequence:
> 
> >>> 'valid UTF-8: é ; invalid UTF-8: \udcff'.encode("utf-8")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position
> 32: surrogates not allowed
> 
> ... which ensures that the transformation is completely bijective.

Neat. Not sure it helps us though. We still have the issue that byte
strings may be unreasonably large.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list