[PATCH 0 of 1] replace Python standard textwrap by MBCS sensitive one for i18n text

Martin Geisler mg at lazybytes.net
Sun May 16 10:33:50 CDT 2010


FUJIWARA Katsunori <fujiwara at ascade.co.jp> writes:

Hi Katsunori,

Thanks for looking at this! I think you should include most this nice
big explanation in the commit message directly. There is almost never a
need for a separate introductiory email.

I'm CCing Henrik since he knew a lot about this the last time we
discussed it.

> Mercurial has problem around text wrapping/filling in MBCS encoding
> environment, because standard 'textwrap' module of Python can not
> treat it correctly. It splits byte sequence for one character into two
> lines.

Right, that's why we decode the bytestrings into Unicode strings in
util.wrap -- I guess we should have used that all over the place in
minirst too.

When that problem is solved, the problem of computing the length of the
string remains. In your patch, you override _handle_long_word in the
textwrap class. I don't think that is 100% correct: the original class
will only call _handle_long_word when it detects that the chunk is long,
i.e., when it has computed the length incorrectly and determined that
the word is too large for the line width.

I think Henrik wrote a custom text wrapper for TortoiseHg? Perhaps to
get it right from the beginning...

> I wrote this patch to replace Python standard textwrap by MBCS
> sensitive one.
>
> # this can be applied only on default(= non-stable),
> # because diff-context is changed in minirst.py
>
> This seems to work correctly, but I worry about determining column
> width of east asian characters for unicode.
>
> # http://www.unicode.org/reports/tr11/
>
> According to unicode specification, result of "east asian width" are:
>
>    W(ide), N(arrow), F(ull-width), H(alf-width), A(mbiguous)
>
> W/N/F/H can be always recognized as 2/1/2/1 bytes in byte sequence,
> but 'A' can not. Size of 'A' depends on language in which it is used.

So this mean that the terminal chooses a different glyph for the same
Unicode character depending on the locale it runs in?

> Unicode specification says:
>
>    If the context(= language) cannot be established reliably they
>    should be treated as narrow characters by default
>
> but many of 'A' characters are full-width, at least, in Japanese
> environment.
>
> In this patch, I introduce environment variable 'HGUCACWIDTH' to
> determine UniCode Ambiguous Character WIDTH.

Could we not just always treat them as full-width? That would mean that
some strings are wrapped too soon, but I don't see that as a problem. It
will only give the text a slightly more ragged appearance. The good
thing is that we would avoid using another environment variable.

-- 
Martin Geisler

Fast and powerful revision control: http://mercurial.selenic.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20100516/2add7190/attachment.pgp>


More information about the Mercurial-devel mailing list