[PATCH] util: wrap lines with multi-byte characters correctly (issue2943)

Matt Mackall mpm at selenic.com
Mon Aug 8 13:49:09 CDT 2011


On Sat, 2011-08-06 at 23:53 +0200, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <mads at kiilerich.com>
> # Date 1312667540 -7200
> # Branch stable
> # Node ID 522ef2a25786c3666d4381d38944fe6d3aa64e5d
> # Parent  f32a2989ff585f0f452f25806750477fc631fc9a
> util: wrap lines with multi-byte characters correctly (issue2943)
> 
> This re-introduces the unicode conversion what was lost in d320e70442a5 5 years
> ago and had the comment:
>   To avoid corrupting multi-byte characters in line, we must wrap
>   a Unicode string instead of a bytestring.

Based on this description, I'm generally inclined to reject this patch.

If I have one patch that says:

 "algorithm A was wrong because of bug X, switch to algorithm B"

and a second one that says:

 "algorithm B was wrong because of bug Y, switch back to A"

...I will reject the second. If I don't, it's inevitable that someone
will send the first patch again and we'll oscillate forever.

Instead I need to either see:

 "algorithm C fixes bug Y while not regressing bug X"

or

 "algorithm A was more correct than B because of bug Y is more
significant than X and algorithm C is prohibitive"

Ideally, a patch here actually documents that oscillating back to B is
not allowed in the source.


Now it seems to me that there are several considerations for a
Unicode-aware text wrapper:

a) single characters may be one or more bytes
b) single characters may be single- or double-width or even an unknown
font-dependent width
c) the rules for where text may be wrapped are somewhat language
dependent (especially with ideographic languages)

I'm guessing we ignore (c) and only wrapping on whitespace. Which means
there are basically three algorithms that are worth considering:

1) encoding-agnostic: count bytes only
2) encoding-aware: count characters
3) width-aware: count Unicode character widths
 3a) naive: assume ambiguous-width characters are narrow
 3b) safe: assume ambiguous-width characters are wide
 3c) too clever: use 3b when using a MBCS or UTF-8, use 3a otherwise

This last is a bit dicey: legacy single-character encodings will never
have double-width characters, but decoding them into Unicode may give
"A"-width characters. Similarly, if we're in a Shift-JIS locale,
basically all double-byte characters will be double-width, but may again
fall into the "A" class in Unicode. But you'd have to do a thorough
survey of encodings to decide whether this was always right.

Note that the only algorithms that are ever 'wrong' in terms of
generating over-long lines are 3a and potentially 3c. 1 and 2 are merely
suboptimal.

So, in terms of the above, what's our story so far, and where does this
patch take us?


> diff --git a/mercurial/util.py b/mercurial/util.py
> --- a/mercurial/util.py
> +++ b/mercurial/util.py
> @@ -1148,16 +1148,14 @@
>          def __init__(self, **kwargs):
>              textwrap.TextWrapper.__init__(self, **kwargs)
>  
> -        def _cutdown(self, str, space_left):
> +        def _cutdown(self, ucstr, space_left):
>              l = 0
> -            ucstr = unicode(str, encoding.encoding)
>              colwidth = unicodedata.east_asian_width
>              for i in xrange(len(ucstr)):
>                  l += colwidth(ucstr[i]) in 'WFA' and 2 or 1
>                  if space_left < l:
> -                    return (ucstr[:i].encode(encoding.encoding),
> -                            ucstr[i:].encode(encoding.encoding))
> -            return str, ''
> +                    return (ucstr[:i], ucstr[i:])
> +            return ucstr, ''
>  
>          # overriding of base class
>          def _handle_long_word(self, reversed_chunks, cur_line, cur_len, width):
> @@ -1179,10 +1177,13 @@
>      if width <= maxindent:
>          # adjust for weird terminal size
>          width = max(78, maxindent + 1)
> +    line = line.decode(encoding.encoding, encoding.encodingmode)
> +    initindent = initindent.decode(encoding.encoding, encoding.encodingmode)
> +    hangindent = hangindent.decode(encoding.encoding, encoding.encodingmode)
>      wrapper = MBTextWrapper(width=width,
>                              initial_indent=initindent,
>                              subsequent_indent=hangindent)
> -    return wrapper.fill(line)
> +    return wrapper.fill(line).encode(encoding.encoding)
>  
>  def iterlines(iterator):
>      for chunk in iterator:
> diff --git a/tests/test-encoding-align.t b/tests/test-encoding-align.t
> --- a/tests/test-encoding-align.t
> +++ b/tests/test-encoding-align.t
> @@ -22,14 +22,14 @@
>    > cmdtable = {
>    >     'showoptlist':
>    >         (showoptlist,
> -  >          [('s', 'opt1', '', 'short width',  '""" + s + """'),
> -  >           ('m', 'opt2', '', 'middle width', '""" + m + """'),
> -  >           ('l', 'opt3', '', 'long width',   '""" + l + """')
> +  >          [('s', 'opt1', '', 'short width'  + ' %(s)s' * 8, '%(s)s'),
> +  >           ('m', 'opt2', '', 'middle width' + ' %(m)s' * 8, '%(m)s'),
> +  >           ('l', 'opt3', '', 'long width'   + ' %(l)s' * 8, '%(l)s')
>    >          ],
>    >          ""
>    >         )
>    > }
> -  > """)
> +  > """ % globals())
>    > f.close()
>    > EOF
>    $ S=`cat s`
> @@ -52,9 +52,11 @@
>    
>    options:
>    
> -   -s --opt1 \xe7\x9f\xad\xe5\x90\x8d          short width (esc)
> -   -m --opt2 MIDDLE_       middle width
> -   -l --opt3 \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d  long width (esc)
> +   -s --opt1 \xe7\x9f\xad\xe5\x90\x8d          short width \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d \xe7\x9f\xad\xe5\x90\x8d (esc)
> +   -m --opt2 MIDDLE_       middle width MIDDLE_ MIDDLE_ MIDDLE_ MIDDLE_ MIDDLE_
> +                           MIDDLE_ MIDDLE_ MIDDLE_
> +   -l --opt3 \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d  long width \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d (esc)
> +                           \xe9\x95\xb7\xe3\x81\x84\xe9\x95\xb7\xe3\x81\x84\xe5\x90\x8d\xe5\x89\x8d (esc)
>    
>    use "hg -v help showoptlist" to show global options
>  
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel


-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list