Bug in description handling

Matt Mackall mpm at selenic.com
Fri Aug 13 09:42:31 CDT 2010

On Fri, 2010-08-13 at 15:42 +0200, Martijn Pieters wrote:
> I've found the following error in the handling of a description UTF-8
> text, when passed through changelog.add.
> Given a line ending in the UTF-8 character '\xc3\x85' (the letter Å),
> .rstrip() on this UTF-8 encoded bytestring will remove the \x85 byte
> as that's a control code in ASCII.

Um, it's not? ASCII is \x00-\x7f. Further, rstrip only considers
'whitespace', not all control characters. In particular:

>>> [chr(c) for c in range(256) if chr(c) != chr(c).rstrip()]
['\t', '\n', '\x0b', '\x0c', '\r', ' ']
>>> [c for c in range(256) if chr(c) != chr(c).rstrip()]
[9, 10, 11, 12, 13, 32]

Even if I hack the site import config to change Python's effectively
hardcoded default from 'ascii' to a charset that actually has control
characters in that range (iso_8859-1), I still can't make rstrip() strip

Ahh, I've spotted the problem:

[c for c in range(256) if unichr(c) != unichr(c).rstrip()]
[9, 10, 11, 12, 13, 28, 29, 30, 31, 32, 133, 160]

You're misusing Unicode and apparently passing your two-byte but one
character UTF-8 string as the two character string u'\xc3
\x85' ("Ã<NEL>") rather than u'\xc5' ("Å"), which rstrip() correctly

Passing Unicode objects to something that's expecting strings (ie all of
Mercurial) is a good way to get unexpected results (usually tracebacks).

Also, be aware that changelog.add (and basically all of Mercurial's
internals) expect strings in the encoding reported by
sys.getdefaultencoding() or set by HGENCODING or manually overridden in
encoding.encoding. Which may not be UTF-8.

Mathematics is the supreme nostalgia of our time.

More information about the Mercurial-devel mailing list