Current py3k stage and next steps
Matt Mackall
mpm at selenic.com
Sun Jun 27 15:29:55 CDT 2010
On Fri, 2010-06-25 at 22:21 +0000, Antoine Pitrou wrote:
> Matt Mackall <mpm <at> selenic.com> writes:
> >
> > The tricky part is this:
> >
> > ui.write() and the like are used to handle three kinds of data:
> >
> > - utf-8 encoded metadata that's been transcoded to the local encoding
> > - internal ASCII messages that may or may not go through gettext()
> > before being present to the user in the local encoding
> > - raw byte data that is presented to the user byte-for-byte as-is
> >
> > In the last case, it's unacceptable to do any form of transcoding even
> > if we knew what encoding the data was in (which we don't and which is
> > not possible in the general case).
>
> The 3.x IO subsystem is layered: sys.stdout is a text (unicode) layer, but you
> can access sys.stdout.buffer which is the underlying buffered bytes layer. Of
> course, it is better to call flush() in-between (even though it might not appear
> necessary in interactive use):
>
> >>> sys.stdout.encoding
> 'UTF-8'
> >>> sys.stdout.write("some unicode text: é\n")
> some unicode text: é
> 21
> >>> sys.stdout.buffer.write(b"some undecodable bytes: \x00\xff\n")
> some undecodable bytes: �
> 27
Yeah, I'm not even worried about the actual I/O part. I'm just using
ui.write as an example of a place where we're routinely combining
strings from different sources for formatting purposes. Focus on the
combining strings part, please.
> > If Unicode had, say, a codeplane to represent "unknown byte 0x??" such
> > that arbitrary byte strings could round-trip losslessly to Unicode, none
> > of this would be a problem (except for overhead). But since that's not
> > possible, Unicode strings are a bad fit for much of what Mercurial
> > does.
>
> Python 3.x does allow you to roundtrip all data through unicode and back to
> bytes, losslessly: by using the "surrogateescape" error handler. It translates
> all undecodable bytes to lone unicode surrogates, and does the reverse operation
> when encoding:
>
> >>> b"valid UTF-8: \xc3\xa9 ; invalid UTF-8: \xff".decode("utf-8",
> "surrogateescape")
> 'valid UTF-8: é ; invalid UTF-8: \udcff'
> >>> 'valid UTF-8: é ; invalid UTF-8: \udcff'.encode("utf-8", "surrogateescape")
> b'valid UTF-8: \xc3\xa9 ; invalid UTF-8: \xff'
>
> Moreover, the same unicode string cannot be produced by a legal UTF-8 sequence:
>
> >>> 'valid UTF-8: é ; invalid UTF-8: \udcff'.encode("utf-8")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position
> 32: surrogates not allowed
>
> ... which ensures that the transformation is completely bijective.
Neat. Not sure it helps us though. We still have the issue that byte
strings may be unreasonably large.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list