[PATCH] mdiff: split on unicode character boundaries when shortening function name

Josef 'Jeff' Sipek jeffpc at josefsipek.net
Thu Feb 22 11:59:00 EST 2018


On Fri, Feb 23, 2018 at 01:06:28 +0900, Yuya Nishihara wrote:
> On Thu, 22 Feb 2018 10:01:00 -0500, Josef 'Jeff' Sipek wrote:
...
> > Yeah... I thought that might be an issue.  The code in the 'except' is meant
> > as best-effort -
> 
> Ok, I didn't notice that. It's indeed better to catch the UnicodeError.
> 
> That said, UTF-8 is well designed encoding, we can easily find the nearest
> multi-byte boundary by looking back a couple of bytes.
> 
> https://en.wikipedia.org/wiki/UTF-8#Description

Right, but isn't this code required to handle any-to-any situation?  That
is, the versioned data can be in any encoding, and the terminal can be in
any encoding.  Currently, the code "handles" it by just copying bytes.  This
obviously breaks down the moment multi-byte characters show up.

UTF-8 being resilient is a good thing, but IMO that justifies leaving the
code alone.

I don't know if there is some weird variable length encoding (other than
UTF-8) out there that hg needs to handle.

> > if there is any UTF-8 issue decoding/encoding, just fall
> > back to previous method.  That of course wouldn't help if the input happened
> > to be valid UTF-8 but wasn't actually UTF-8.
> > 
> > I had to do the encode step, otherwise I got a giant stack trace saying that
> > unicode strings cannot be <something I don't remember> using ascii encoder.
> > (Leaving it un-encoded would also mean that this for loop would output
> > either a unicode string or a raw string - which seems unclean.)
> > 
> > I'm not really sure how to proceed.  Most UTF-8 decoders should handle the
> > illegal byte sequence ok, but it still feels wrong to let it make a mess of
> > valid data.  The answer might be to just ignore this issue.  :|
> 
> As an old Linux user, I would say yeah, don't bother about non-ascii characters,
> it's just bytes. Alternatively, maybe we could take it a UTF-8 sequence and find
> a possible boundary, but I'm not sure if it's a good idea.

As in: implement a UTF-8 decoder to "seek" to the right place?  Eh.

I'm looking forward to the day when everything is only Unicode, but that'll
be a while...

Jeff.

-- 
Only two things are infinite, the universe and human stupidity, and I'm not
sure about the former.
		- Albert Einstein


More information about the Mercurial-devel mailing list