[PATCH 1 of 5] encoding: add getutf8char helper
Matt Mackall
mpm at selenic.com
Sun Nov 8 14:18:20 CST 2015
On Sun, 2015-11-08 at 18:12 +0900, Yuya Nishihara wrote:
> On Fri, 06 Nov 2015 15:48:46 -0600, Matt Mackall wrote:
> > # HG changeset patch
> > # User Matt Mackall <mpm at selenic.com>
> > # Date 1446763726 21600
> > # Thu Nov 05 16:48:46 2015 -0600
> > # Node ID 9939f038b1679e2729efe87f8e67af7ce2885a73
> > # Parent f9984f76fd90e439221425d751e29bae17bec995
> > encoding: add getutf8char helper
> >
> > This allows us to find character boundaries in byte strings when
> > trying to do custom encodings.
> >
> > diff -r f9984f76fd90 -r 9939f038b167 mercurial/encoding.py
> > --- a/mercurial/encoding.py Wed Nov 04 15:17:52 2015 -0600
> > +++ b/mercurial/encoding.py Thu Nov 05 16:48:46 2015 -0600
> > @@ -414,6 +414,25 @@
> >
> > return ''.join(_jsonmap[c] for c in toutf8b(s))
> >
> > +_utf8len = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 4]
> > +
> > +def getutf8char(s, pos):
> > + '''get the next full utf-8 character in the given string,
> > starting at pos
> > +
> > + Raises a UnicodeError if the given location does not start a
> > valid
> > + utf-8 character.
> > + '''
> > +
> > + # find how many bytes to attempt decoding from first nibble
> > + l = _utf8len[ord(s[pos]) >> 4]
> > + if not l: # ascii
> > + return s[pos]
> > +
> > + c = s[pos:pos + l]
> > + # validate with attempted decode
> > + c.decode("utf-8")
> > + return c
>
> The series looks great to me. Extended round-trip tests will follow?
Someone had already posted a (breaking) test using Hypothesis which is
probably the right thing to do. I don't know how that works, so I wrote
a very rough and dirty script to test it.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list