Would you expect bytes or unicode (or both) for the hglib API in Python 3?

Matt Mackall mpm at selenic.com
Tue Jan 27 17:19:16 CST 2015


On Tue, 2015-01-27 at 17:49 -0500, Brett Cannon wrote:
> I have a need to query Mercurial repositories for some log data and I want
> to do it from Python 3. I would like to use hglib but it currently doesn't
> support past Python 2.7. I'm willing to try and port it so it can (somehow)
> support Python 2.4 - Python 3.4, but before I do that I would like to know
> two things: 1) would you accept a porting of the library to Python 3 if it
> can still support Python 2.4 (although obviously my life would be easier if
> Python 2.6 was the cut-off =), and 2) what kind of Python 3 API would you
> want the library to have?

Thanks for looking into this!

For (1), yes, but I'd also consider a second branch in the repo if that
proved impossible. I wouldn't bother supporting 3.x < 3.4 though.

> For 2) what I'm specifically wondering about is whether the API should be
> returning bytes, Unicode strings, or should it depend on which method is
> called in hglib.client.hgclient? 

Project data (file content and filenames) are stored and communicated as
byte strings with Mercurial being willfully ignorant of their encoding
(or indeed whether they're even text!). You'll want to hand these back
as bytes as no other choice is possible.

Mercurial _metadata_ (author, commit comments, branch names, tags,
bookmarks, etc.) are stored in UTF-8. If an hglib client sets an
encoding of UTF-8, you'll get back strings suitable to hand to
unicode().

Two big caveats here though: 

- hglib by default uses the "local encoding" when talking to Mercurial,
not UTF-8, which lets most clients ignore encoding
- having a mix of bytes and unicode objects is well-known headache

You might want to start with/default to all bytes. You could consider
later adding a mode parameter that says one of:

- "bytes please"
- "hand back metadata as Unicode, leave data as bytes"
- "hand back everything as Unicode, correctness be damned"

(In addition to being impossible to do the right things with binaries
with the latter, you're going to have fun with Windows paths.)

One last possibility is:

- "hand back everything as Unicode with surrogate encoding"

..which will kinda sorta work.

Related reading:

http://mercurial.selenic.com/wiki/EncodingStrategy
http://mercurial.selenic.com/wiki/WindowsUTF8Plan

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list