[PATCH 0 of 1] replace Python standard textwrap by MBCS sensitive one for i18n text

FUJIWARA Katsunori fujiwara at ascade.co.jp
Sun May 16 06:15:26 CDT 2010


Mercurial has problem around text wrapping/filling in MBCS encoding
environment, because standard 'textwrap' module of Python can not
treat it correctly. It splits byte sequence for one character into two
lines.

I wrote this patch to replace Python standard textwrap by MBCS
sensitive one.

# this can be applied only on default(= non-stable),
# because diff-context is changed in minirst.py

This seems to work correctly, but I worry about determining column
width of east asian characters for unicode.

# http://www.unicode.org/reports/tr11/

According to unicode specification, result of "east asian width" are:

   W(ide), N(arrow), F(ull-width), H(alf-width), A(mbiguous)

W/N/F/H can be always recognized as 2/1/2/1 bytes in byte sequence,
but 'A' can not. Size of 'A' depends on language in which it is used.

Unicode specification says:

   If the context(= language) cannot be established reliably they
   should be treated as narrow characters by default

but many of 'A' characters are full-width, at least, in Japanese
environment.

In this patch, I introduce environment variable 'HGUCACWIDTH' to
determine UniCode Ambiguous Character WIDTH.

If there are any other easy (and appropriate) ways to determine it in
Python code, please teach me it !

If there are few languages other than Japanese which require 2(or
more) bytes for 'A' character, managing language lookup table also
seems to be reasonable in use and maintenance.


More information about the Mercurial-devel mailing list