Current py3k stage and next steps

Martin Geisler mg at lazybytes.net
Mon Jun 28 16:05:19 CDT 2010


Matt Mackall <mpm at selenic.com> writes:

> On Mon, 2010-06-28 at 21:51 +0200, Martin Geisler wrote:
>> The think I want to make explode is
>> 
>>   _("foo %s bar") % rawbytes
>
> But we do this.. everywhere a filename is mentioned, for starters.

There you go -- in my view, you've just pointed out a source of bugs.
You know I'm in the "we should decode filenames to unicode on input and
encode them to local encoding on output"-camp.

> That's a LOT of places. And some of them are more Unicodism unfriendly
> than others (think templating).
>
> And changing this doesn't gain us anything. If _() gives us Unicode
> and sys.write outputs that in the local encoding, _() might as well
> give us the local encoding to start with, because at least it doesn't
> break.

I think we have different diffenations of what "breaks" mean. To me,
things are broken we mix encodings in user output (like the filenames
mixed with commit messages in 'hg log -v'). I think you only consider it
broken when a UnicodeDecodeError is thrown -- and if so, then I fully
understand if you want as much as possible to be of type str.

> About the only things that internal Unicode representation gets you is
> native character-by-character indexing (eg a[x]) and character-based
> length (len(a)). The latter isn't even useful - it'll break the moment
> you throw Japanese or various normalized forms at it. And we do
> basically none of either of those anyway. Using Unicode is basically
> all headache, no gain.

Yes, there will be headaches. We've already had some headaches when we
wrap text -- this used to be broken because we were naive and wrapped
the str objects directly.

> Your goal seems to be to introduce some sort of type-checking. But
> this is not a boundary where we care about that. "ui messages" and
> "user data" frequently get combined in output and that's good and
> natural.

I don't agree with that -- my example from above with 'hg log -v' is
actually pretty good, I think :-) It looks like this here

  % hg log -r tip -v
  ændring:     11456:88abbb046e66
  gren:        stable
  mærkat:      tip
  bruger:      Matt Mackall <mpm at selenic.com>
  dato:        Mon Jun 28 11:07:27 2010 -0500
  filer:       mercurial/revset.py tests/test-revset tests/test-revset.out
  beskrivelse:
  revset: deal with empty sets in range endpoints

  (spotted by Julian Cowley <julian at lava.net>)

Here we have strings which were translated and transcoded ("ændring"),
filenames which were output raw ("mercurial/revset.py"), and metadata
which were transcoded ("revset: deal...").

This is the mixed output that I don't like. It works here because all
the filenames are in ASCII, and so a subset of my Latin-1 encoding.

> The places where we care is where we're going from local encoding
> (whatever it may be) to UTF-8 (the encoding Mercurial uses for its own
> metadata). Just about all of this happens down in changelog.add()
> where none of the rest of the code ever needs to think about it, but
> there is some complexity related to branch names here.

The idea of separating things into str and ustr objects was inspired by
the trouble Sune has been having with keeping track of branch names. He
has fixed a number of bugs there, and as far as I understood it, he is
still not certain that there aren't any more bugs in that code. I've
CCed him, so maybe he can speak for himself.

-- 
Martin Geisler

Mercurial links: http://mercurial.ch/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20100628/10f63d23/attachment.pgp>


More information about the Mercurial-devel mailing list