Would you expect bytes or unicode (or both) for the hglib API in Python 3?

Brett Cannon bcannon at gmail.com
Wed Jan 28 09:26:43 CST 2015


On Tue Jan 27 2015 at 6:19:30 PM Matt Mackall <mpm at selenic.com> wrote:

> On Tue, 2015-01-27 at 17:49 -0500, Brett Cannon wrote:
> > I have a need to query Mercurial repositories for some log data and I
> want
> > to do it from Python 3. I would like to use hglib but it currently
> doesn't
> > support past Python 2.7. I'm willing to try and port it so it can
> (somehow)
> > support Python 2.4 - Python 3.4, but before I do that I would like to
> know
> > two things: 1) would you accept a porting of the library to Python 3 if
> it
> > can still support Python 2.4 (although obviously my life would be easier
> if
> > Python 2.6 was the cut-off =), and 2) what kind of Python 3 API would you
> > want the library to have?
>
> Thanks for looking into this!
>
> For (1), yes, but I'd also consider a second branch in the repo if that
> proved impossible. I wouldn't bother supporting 3.x < 3.4 though.
>

Great! If the Python 2.4 support becomes a hassle would you want me to fork
for Python 2.6 and newer or be pure Python 3.4? In the end I would expect
the only real difference is the use of __future__ statements.


>
> > For 2) what I'm specifically wondering about is whether the API should be
> > returning bytes, Unicode strings, or should it depend on which method is
> > called in hglib.client.hgclient?
>
> Project data (file content and filenames) are stored and communicated as
> byte strings with Mercurial being willfully ignorant of their encoding
> (or indeed whether they're even text!). You'll want to hand these back
> as bytes as no other choice is possible.
>
> Mercurial _metadata_ (author, commit comments, branch names, tags,
> bookmarks, etc.) are stored in UTF-8. If an hglib client sets an
> encoding of UTF-8, you'll get back strings suitable to hand to
> unicode().
>
> Two big caveats here though:
>
> - hglib by default uses the "local encoding" when talking to Mercurial,
> not UTF-8, which lets most clients ignore encoding
> - having a mix of bytes and unicode objects is well-known headache
>
> You might want to start with/default to all bytes.


I will start with that. From what I have seen in the code it basically
shifts most of the changes from solely in the library code to both library
and tests but in a more mechanical fashion (i.e., string literals get
marked as bytes everywhere vs. containing all changes in the library to
handle string decoding for the API).


> You could consider
> later adding a mode parameter that says one of:
>
> - "bytes please"
> - "hand back metadata as Unicode, leave data as bytes"
>

Once I have working code we can discuss this before making this available
to the public as it sounds like metadata should be decoded but I would want
to clearly list out exactly where that makes sense so I don't overreach
(plus it might make the command server encoding parts of hglib not needed
if the code handles all decoding for data where HGENCODING comes into play).


> - "hand back everything as Unicode, correctness be damned"
>
> (In addition to being impossible to do the right things with binaries
> with the latter, you're going to have fun with Windows paths.)
>
> One last possibility is:
>
> - "hand back everything as Unicode with surrogate encoding"
>
> ..which will kinda sorta work.
>

Sure, but yuck. =)

-Brett


>
> Related reading:
>
> http://mercurial.selenic.com/wiki/EncodingStrategy
> http://mercurial.selenic.com/wiki/WindowsUTF8Plan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20150128/6f457035/attachment.html>


More information about the Mercurial-devel mailing list