[PATCH] util: wrap lines with multi-byte characters correctly (issue2943)

Mon Aug 8 20:27:57 CDT 2011

Matt Mackall wrote, On 08/08/2011 08:49 PM:
> On Sat, 2011-08-06 at 23:53 +0200, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich<mads at kiilerich.com>
>> # Date 1312667540 -7200
>> # Branch stable
>> # Node ID 522ef2a25786c3666d4381d38944fe6d3aa64e5d
>> # Parent  f32a2989ff585f0f452f25806750477fc631fc9a
>> util: wrap lines with multi-byte characters correctly (issue2943)
>>
>> This re-introduces the unicode conversion what was lost in d320e70442a5 5 years

(I don't know how I came up with the 5 years.)

>> ago and had the comment:
>>    To avoid corrupting multi-byte characters in line, we must wrap
>>    a Unicode string instead of a bytestring.
> Based on this description, I'm generally inclined to reject this patch.

The comment and the associated code was introduced in 
http://www.selenic.com/hg/rev/44758742ad2e - without any tests. At the 
same time it also made sure wrapping was based on number of characters 
instead of number of encoded bytes.

http://www.selenic.com/hg/rev/b2d65ee49a72 re-introduced a 
non-unicode-aware code path when decoding failed. I don't know what made 
this necessary.

Then came http://www.selenic.com/hg/rev/d320e70442a5 and made a lot of 
unrelated changes:

1: Fixed some incorrect uses of util.wrap where initindent and 
hangindent wasn't used correctly and would render the output incorrectly 
if it actually wrapped the lines. (commands.py and no tests)

2: Be safe and assume that the width of east asian 'A' characters is 2 
as the unicode spec says. (encoding.py and tested in test-encoding)

3: Use util.wrap instead of textwrap directly to benefit from improved 
handling of encoded text. (minirst.py and templatefilters.py and no tests)

4: Don't rtrim whitespace on encoded strings as if it was ASCII but 
decode to unicode first. (templatefilters.py and no tests)

5: Introduce MBTextWrapper (util.py and no tests):

5a: Calculate width of east asian characters correctly in the case where 
a single word is longer than the wrap length and has to be broken up. 
(However, AFAICS textwrap will calculate the word (and line) length 
without considering wide characters and the word (and line) can thus be 
longer than the wrap length without ending up in this special case.)

5b: Remove handling of the case where the text can't be decoded to 
unicode and thus effectively back out b2d65ee49a72. I don't know why 
that haven't caused any problems. Perhaps because it only can happen 
when user data is wrapped, and that can only happen in custom templates 
where exceptions are hidden? Or because textwrap now (incorrectlyl) will 
try to wrap encoded data and only try to decode words longer than the 
wrap width?

5c: Changed the input of textwrap so it no longer works on unicode but 
on the encoded string and thus effectively backed out 44758742ad2e. As a 
consequence textwrap can then only count bytes and don't have any chance 
to count the number of actual characters or consider their width. AFAICS 
that also introduces the bug that makes the special handling of 
break_long_words in _handle_long_word necessary. Another new bug seems 
to be that it now can insert line breaks instead of any byte that looks 
like ASCII whitespace.

5c seems bad in many ways. I don't see any indication that this change 
was intentional or had any positive side-effects. I essentially propose 
to back that part out. That fixes issue2943.

The MBTextWrapper code we have now seems wrong and has no tests, so it 
is very possible that any changes here could introduce "unintended 
behaviour". That should however not prevent bugfixes and cleanups in 
this area forever - especially not if bugfixes come with tests.

AFAICS d320e70442a5 wasn't ready for inclusion and there are several 
reasons it shouldn't have been accepted as it is, so an oscillation that 
should cause it to be redone would actually not be so bad. (The 
primarily objections is that it try to do several things in one patch, 
as a consequence of that the description is bad, it doesn't add tests, 
it back out previous changes without any explanation, and it introduces 
code that either is wrong or need a good explanation.)

[Meta comment: It would be great if we all made sure we actually 
followed our own ContributingChanges guidelines... How about giving a 
monthly award for best and worst patch? That would give some awareness.]

> If I have one patch that says:
>
>   "algorithm A was wrong because of bug X, switch to algorithm B"
>
> and a second one that says:
>
>   "algorithm B was wrong because of bug Y, switch back to A"
>
> ...I will reject the second. If I don't, it's inevitable that someone
> will send the first patch again and we'll oscillate forever.

AFAICS the second patch that should have been rejected is d320.

It is however not clear what X and Y is. AFAICS we don't have any Y for 5c.

> Instead I need to either see:
>
>   "algorithm C fixes bug Y while not regressing bug X"

That could be: "reintroducing unicode input to textwrap (which was 
introduced in 44758 but removed in d320) fixes issue2943 while not 
regressing on breaking words with east asian wide characters and longer 
than the wrap width as introduced in d320"

> or
>
>   "algorithm A was more correct than B because of bug Y is more
> significant than X and algorithm C is prohibitive"
>
> Ideally, a patch here actually documents that oscillating back to B is
> not allowed in the source.

> Now it seems to me that there are several considerations for a
> Unicode-aware text wrapper:
>
> a) single characters may be one or more bytes
> b) single characters may be single- or double-width or even an unknown
> font-dependent width
> c) the rules for where text may be wrapped are somewhat language
> dependent (especially with ideographic languages)
>
> I'm guessing we ignore (c) and only wrapping on whitespace.

No. break_long_words is True by default, so textwrap will wrap between 
characters too if that is the only option. I guess that is the only 
feasible way to add line breaks to ideographic langauges that don't use 
spaces.

> Which means
> there are basically three algorithms that are worth considering:
>
> 1) encoding-agnostic: count bytes only

This is what we have in MBTextWrapper now for ordinary lines that are 
wrapped without any unicode awareness. That is what I would like to fix.

> 2) encoding-aware: count characters

For the record: The only feasible way to do that is to decode to unicode 
and work on the unicode string. That is what 44758 introduced and d320 
removed and I would like to re-introduce.

> 3) width-aware: count Unicode character widths

This was apparently the main motivation for introducing MBTextWrapper.

>   3a) naive: assume ambiguous-width characters are narrow
>   3b) safe: assume ambiguous-width characters are wide

This is what d320 introduced in encoding.py (and textwrapper).

>   3c) too clever: use 3b when using a MBCS or UTF-8, use 3a otherwise
>
> This last is a bit dicey: legacy single-character encodings will never
> have double-width characters, but decoding them into Unicode may give
> "A"-width characters.

Interesting. Can you give some examples of this?

> Similarly, if we're in a Shift-JIS locale,
> basically all double-byte characters will be double-width, but may again
> fall into the "A" class in Unicode. But you'd have to do a thorough
> survey of encodings to decide whether this was always right.
>
> Note that the only algorithms that are ever 'wrong' in terms of
> generating over-long lines are 3a and potentially 3c. 1 and 2 are merely
> suboptimal.

If someone considers 3b) a problem then someone could introduce 
something like
encoding.wideclasses = os.environ.get("HGWIDECLASSES", "WFA")

> So, in terms of the above, what's our story so far, and where does this
> patch take us?
It seems like your model is correct, and the previous and current mess in this area can relate to your model.

This patch do not add anything new but will take us back to 2) from 1) while preserving 3) where it applied.

This patch will perhaps make it more obvious that b2d6 should be re-introduced too.

/Mads