[PATCH] util: wrap lines with multi-byte characters correctly (issue2943)
Mads Kiilerich
mads at kiilerich.com
Mon Aug 8 20:27:57 CDT 2011
Matt Mackall wrote, On 08/08/2011 08:49 PM:
> On Sat, 2011-08-06 at 23:53 +0200, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich<mads at kiilerich.com>
>> # Date 1312667540 -7200
>> # Branch stable
>> # Node ID 522ef2a25786c3666d4381d38944fe6d3aa64e5d
>> # Parent f32a2989ff585f0f452f25806750477fc631fc9a
>> util: wrap lines with multi-byte characters correctly (issue2943)
>>
>> This re-introduces the unicode conversion what was lost in d320e70442a5 5 years
(I don't know how I came up with the 5 years.)
>> ago and had the comment:
>> To avoid corrupting multi-byte characters in line, we must wrap
>> a Unicode string instead of a bytestring.
> Based on this description, I'm generally inclined to reject this patch.
The comment and the associated code was introduced in
http://www.selenic.com/hg/rev/44758742ad2e - without any tests. At the
same time it also made sure wrapping was based on number of characters
instead of number of encoded bytes.
http://www.selenic.com/hg/rev/b2d65ee49a72 re-introduced a
non-unicode-aware code path when decoding failed. I don't know what made
this necessary.
Then came http://www.selenic.com/hg/rev/d320e70442a5 and made a lot of
unrelated changes:
1: Fixed some incorrect uses of util.wrap where initindent and
hangindent wasn't used correctly and would render the output incorrectly
if it actually wrapped the lines. (commands.py and no tests)
2: Be safe and assume that the width of east asian 'A' characters is 2
as the unicode spec says. (encoding.py and tested in test-encoding)
3: Use util.wrap instead of textwrap directly to benefit from improved
handling of encoded text. (minirst.py and templatefilters.py and no tests)
4: Don't rtrim whitespace on encoded strings as if it was ASCII but
decode to unicode first. (templatefilters.py and no tests)
5: Introduce MBTextWrapper (util.py and no tests):
5a: Calculate width of east asian characters correctly in the case where
a single word is longer than the wrap length and has to be broken up.
(However, AFAICS textwrap will calculate the word (and line) length
without considering wide characters and the word (and line) can thus be
longer than the wrap length without ending up in this special case.)
5b: Remove handling of the case where the text can't be decoded to
unicode and thus effectively back out b2d65ee49a72. I don't know why
that haven't caused any problems. Perhaps because it only can happen
when user data is wrapped, and that can only happen in custom templates
where exceptions are hidden? Or because textwrap now (incorrectlyl) will
try to wrap encoded data and only try to decode words longer than the
wrap width?
5c: Changed the input of textwrap so it no longer works on unicode but
on the encoded string and thus effectively backed out 44758742ad2e. As a
consequence textwrap can then only count bytes and don't have any chance
to count the number of actual characters or consider their width. AFAICS
that also introduces the bug that makes the special handling of
break_long_words in _handle_long_word necessary. Another new bug seems
to be that it now can insert line breaks instead of any byte that looks
like ASCII whitespace.
5c seems bad in many ways. I don't see any indication that this change
was intentional or had any positive side-effects. I essentially propose
to back that part out. That fixes issue2943.
The MBTextWrapper code we have now seems wrong and has no tests, so it
is very possible that any changes here could introduce "unintended
behaviour". That should however not prevent bugfixes and cleanups in
this area forever - especially not if bugfixes come with tests.
AFAICS d320e70442a5 wasn't ready for inclusion and there are several
reasons it shouldn't have been accepted as it is, so an oscillation that
should cause it to be redone would actually not be so bad. (The
primarily objections is that it try to do several things in one patch,
as a consequence of that the description is bad, it doesn't add tests,
it back out previous changes without any explanation, and it introduces
code that either is wrong or need a good explanation.)
[Meta comment: It would be great if we all made sure we actually
followed our own ContributingChanges guidelines... How about giving a
monthly award for best and worst patch? That would give some awareness.]
> If I have one patch that says:
>
> "algorithm A was wrong because of bug X, switch to algorithm B"
>
> and a second one that says:
>
> "algorithm B was wrong because of bug Y, switch back to A"
>
> ...I will reject the second. If I don't, it's inevitable that someone
> will send the first patch again and we'll oscillate forever.
AFAICS the second patch that should have been rejected is d320.
It is however not clear what X and Y is. AFAICS we don't have any Y for 5c.
> Instead I need to either see:
>
> "algorithm C fixes bug Y while not regressing bug X"
That could be: "reintroducing unicode input to textwrap (which was
introduced in 44758 but removed in d320) fixes issue2943 while not
regressing on breaking words with east asian wide characters and longer
than the wrap width as introduced in d320"
> or
>
> "algorithm A was more correct than B because of bug Y is more
> significant than X and algorithm C is prohibitive"
>
> Ideally, a patch here actually documents that oscillating back to B is
> not allowed in the source.
> Now it seems to me that there are several considerations for a
> Unicode-aware text wrapper:
>
> a) single characters may be one or more bytes
> b) single characters may be single- or double-width or even an unknown
> font-dependent width
> c) the rules for where text may be wrapped are somewhat language
> dependent (especially with ideographic languages)
>
> I'm guessing we ignore (c) and only wrapping on whitespace.
No. break_long_words is True by default, so textwrap will wrap between
characters too if that is the only option. I guess that is the only
feasible way to add line breaks to ideographic langauges that don't use
spaces.
> Which means
> there are basically three algorithms that are worth considering:
>
> 1) encoding-agnostic: count bytes only
This is what we have in MBTextWrapper now for ordinary lines that are
wrapped without any unicode awareness. That is what I would like to fix.
> 2) encoding-aware: count characters
For the record: The only feasible way to do that is to decode to unicode
and work on the unicode string. That is what 44758 introduced and d320
removed and I would like to re-introduce.
> 3) width-aware: count Unicode character widths
This was apparently the main motivation for introducing MBTextWrapper.
> 3a) naive: assume ambiguous-width characters are narrow
> 3b) safe: assume ambiguous-width characters are wide
This is what d320 introduced in encoding.py (and textwrapper).
> 3c) too clever: use 3b when using a MBCS or UTF-8, use 3a otherwise
>
> This last is a bit dicey: legacy single-character encodings will never
> have double-width characters, but decoding them into Unicode may give
> "A"-width characters.
Interesting. Can you give some examples of this?
> Similarly, if we're in a Shift-JIS locale,
> basically all double-byte characters will be double-width, but may again
> fall into the "A" class in Unicode. But you'd have to do a thorough
> survey of encodings to decide whether this was always right.
>
> Note that the only algorithms that are ever 'wrong' in terms of
> generating over-long lines are 3a and potentially 3c. 1 and 2 are merely
> suboptimal.
If someone considers 3b) a problem then someone could introduce
something like
encoding.wideclasses = os.environ.get("HGWIDECLASSES", "WFA")
> So, in terms of the above, what's our story so far, and where does this
> patch take us?
It seems like your model is correct, and the previous and current mess in this area can relate to your model.
This patch do not add anything new but will take us back to 2) from 1) while preserving 3) where it applied.
This patch will perhaps make it more obvious that b2d6 should be re-introduced too.
/Mads
More information about the Mercurial-devel
mailing list