[Bug 4927] New: mercurial.encoding.toutf8b produces output fromutf8b cannot decode

Mon Nov 2 13:55:55 UTC 2015

https://bz.mercurial-scm.org/show_bug.cgi?id=4927

            Bug ID: 4927
           Summary: mercurial.encoding.toutf8b produces output fromutf8b
                    cannot decode
           Product: Mercurial
           Version: stable branch
          Hardware: PC
                OS: Linux
            Status: UNCONFIRMED
          Severity: feature
          Priority: wish
         Component: Mercurial
          Assignee: bugzilla at selenic.com
          Reporter: david at drmaciver.com
                CC: mercurial-devel at selenic.com

We discovered this during the sprint at Facebook London. The internal encoding
to utf-8b does not always work. In particular:

fromutf8b(toutf8b('\xc2\xc2\x80'))

produces the following exception:

File "/home/david/external/hg/mercurial/encoding.py", line 485, in fromutf8b
    u = s.decode("utf-8")
File "/home/david/.pyenv/versions/2.7.10/lib/python2.7/encodings/utf_8.py",
line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: invalid
continuation byte

This is because toutf8b is producing invalid utf-8:

>>> toutf8b('\xc2\xc2\x80')
'\xc2\xed\xb3\x82\x80'

The reason this happens is I think that the encoder is getting confused by the
two \xc2 characters: The utf-8 stripped version of this is:

>>> '\xc2\xc2\x80'.decode('utf-8', 'ignore').encode('utf-8')
'\xc2\x80'

So the check for whether we can just include the byte verbatim in the output
thinks that it's legit to include the first \xc2 character even though it
shouldn't.

This was discovered using Hypothesis
(https://hypothesis.readthedocs.org/en/latest/) to test the round-tripping
behaviour.

-- 
You are receiving this mail because:
You are on the CC list for the bug.