[PATCH] mdiff: split on unicode character boundaries when shortening function name

Thu Feb 22 11:06:28 EST 2018

On Thu, 22 Feb 2018 10:01:00 -0500, Josef 'Jeff' Sipek wrote:
> On Thu, Feb 22, 2018 at 21:01:44 +0900, Yuya Nishihara wrote:
> > On Thu, 22 Feb 2018 09:01:14 +0100, Denis Laxalde wrote:
> > > Josef 'Jeff' Sipek wrote:
> > > > # HG changeset patch
> > > > # User Josef 'Jeff' Sipek <jeffpc at josefsipek.net>
> > > > # Date 1519251311 18000
> > > > #      Wed Feb 21 17:15:11 2018 -0500
> > > > # Node ID b99df94fdd4813e0ce538a8caa682802da4a6cb2
> > > > # Parent  106872aa15af9919220705ed72c78459774e1575
> > > > mdiff: split on unicode character boundaries when shortening function name
> > > > 
> > > > Splitting the raw bytes may lead to truncating the string in the middle of a
> > > > UTF-8 character which would lead to the generated diff containing an invalid
> > > > byte sequence even though the original data had none.  For example, the
> > > > Unicode codepoint U+308B (る) gets represented as \xe3\x82\x8b in UTF-8.
> > > > Before this change a diff on i18n/ja.po would yield:
> > > > 
> > > > 	@@ -28953,7 +28953,7 @@ msgstr "Mercurial と SSH を併用す<E3><82>
> > > > 
> > > > After this change, the output is cleaner:
> > > > 
> > > > 	@@ -28953,7 +28953,7 @@ msgstr "Mercurial と SSH を併用する場合の注意点:"
> > > > 
> > > > diff --git a/mercurial/mdiff.py b/mercurial/mdiff.py
> > > > --- a/mercurial/mdiff.py
> > > > +++ b/mercurial/mdiff.py
> > > > @@ -348,7 +348,12 @@ def _unidiff(t1, t2, opts=defaultopts):
> > > >              # alphanumeric char.
> > > >              for i in xrange(astart - 1, lastpos - 1, -1):
> > > >                  if l1[i][0:1].isalnum():
> > > > -                    func = ' ' + l1[i].rstrip()[:40]
> > > > +                    func = l1[i].rstrip()
> > > > +                    try:
> > > > +                        func = func.decode("utf-8")[:40].encode("utf-8")
> > > > +                    except:
> > > > +                        func = func[:40]
> > > 
> > > I'd suggest catching exception types explicitly (UnicodeDecodeError and
> > > UnicodeEncodeError I guess) and avoid "bare" except.
> > > 
> > > (No idea if the change itself is correct by the way.)
> > 
> > Nah, it's wrong to assume external world is always UTF-8. It might be okayish
> > to split func as a UTF-8 as best effort, but which shouldn't involve encoding
> > conversion.
> 
> Yeah... I thought that might be an issue.  The code in the 'except' is meant
> as best-effort -

Ok, I didn't notice that. It's indeed better to catch the UnicodeError.

That said, UTF-8 is well designed encoding, we can easily find the nearest
multi-byte boundary by looking back a couple of bytes.

https://en.wikipedia.org/wiki/UTF-8#Description

> if there is any UTF-8 issue decoding/encoding, just fall
> back to previous method.  That of course wouldn't help if the input happened
> to be valid UTF-8 but wasn't actually UTF-8.
> 
> I had to do the encode step, otherwise I got a giant stack trace saying that
> unicode strings cannot be <something I don't remember> using ascii encoder.
> (Leaving it un-encoded would also mean that this for loop would output
> either a unicode string or a raw string - which seems unclean.)
> 
> I'm not really sure how to proceed.  Most UTF-8 decoders should handle the
> illegal byte sequence ok, but it still feels wrong to let it make a mess of
> valid data.  The answer might be to just ignore this issue.  :|

As an old Linux user, I would say yeah, don't bother about non-ascii characters,
it's just bytes. Alternatively, maybe we could take it a UTF-8 sequence and find
a possible boundary, but I'm not sure if it's a good idea.