[Bug 4927] New: mercurial.encoding.toutf8b produces output fromutf8b cannot decode
mercurial-bugs at selenic.com
mercurial-bugs at selenic.com
Mon Nov 2 13:55:55 UTC 2015
https://bz.mercurial-scm.org/show_bug.cgi?id=4927
Bug ID: 4927
Summary: mercurial.encoding.toutf8b produces output fromutf8b
cannot decode
Product: Mercurial
Version: stable branch
Hardware: PC
OS: Linux
Status: UNCONFIRMED
Severity: feature
Priority: wish
Component: Mercurial
Assignee: bugzilla at selenic.com
Reporter: david at drmaciver.com
CC: mercurial-devel at selenic.com
We discovered this during the sprint at Facebook London. The internal encoding
to utf-8b does not always work. In particular:
fromutf8b(toutf8b('\xc2\xc2\x80'))
produces the following exception:
File "/home/david/external/hg/mercurial/encoding.py", line 485, in fromutf8b
u = s.decode("utf-8")
File "/home/david/.pyenv/versions/2.7.10/lib/python2.7/encodings/utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: invalid
continuation byte
This is because toutf8b is producing invalid utf-8:
>>> toutf8b('\xc2\xc2\x80')
'\xc2\xed\xb3\x82\x80'
The reason this happens is I think that the encoder is getting confused by the
two \xc2 characters: The utf-8 stripped version of this is:
>>> '\xc2\xc2\x80'.decode('utf-8', 'ignore').encode('utf-8')
'\xc2\x80'
So the check for whether we can just include the byte verbatim in the output
thinks that it's legit to include the first \xc2 character even though it
shouldn't.
This was discovered using Hypothesis
(https://hypothesis.readthedocs.org/en/latest/) to test the round-tripping
behaviour.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Mercurial-devel
mailing list