[PATCH 1 of 2] encoding: make utf8b encoder more robust (issue4927)

Wed Nov 4 07:46:26 CST 2015

On Mon, 02 Nov 2015 17:27:19 -0600, Matt Mackall wrote:
> # HG changeset patch
> # User Matt Mackall <mpm at selenic.com>
> # Date 1446506176 21600
> #      Mon Nov 02 17:16:16 2015 -0600
> # Node ID 6bee6f327de32755da038193f453aa6bed6810c7
> # Parent  859f453e8b4e2b42b6b6552b79c5c5e7e2fc1cf7
> encoding: make utf8b encoder more robust (issue4927)
> 
> It could lose sync if it saw a dropped character. The new code
> explicitly looks for a new replacement character sequence (U+fffd) appearing.
> This requires rewriting the loop to allow lookahead on the source so
> that we can see if the replacement sequence is on both sides.
> 
> diff -r 859f453e8b4e -r 6bee6f327de3 mercurial/encoding.py
> --- a/mercurial/encoding.py	Mon Nov 02 12:12:24 2015 -0800
> +++ b/mercurial/encoding.py	Mon Nov 02 17:16:16 2015 -0600
> @@ -452,15 +452,24 @@
>          return s
>      except UnicodeDecodeError:
>          # surrogate-encode any characters that don't round-trip
> -        s2 = s.decode('utf-8', 'ignore').encode('utf-8')
> +        s2 = s.decode('utf-8', 'replace').encode('utf-8')
>          r = ""
> -        pos = 0
> -        for c in s:
> -            if s2[pos:pos + 1] == c:
> -                r += c
> -                pos += 1
> +        pos1 = 0
> +        pos2 = 0
> +        l = len(s)
> +        while pos1 < l:
> +            if (s2[pos2] == "\xef" and
> +                s2[pos2:pos2 + 3] == "\xef\xbf\xbd" and
> +                s[pos1:pos1 + 3] != "\xef\xbf\xbd"):
> +                # character got replaced by U+fffd, add surrogate
> +                r += unichr(0xdc00 + ord(s[pos1])).encode('utf-8')
> +                # skip over replacement character
> +                pos1 += 1
> +                pos2 += 3

I got IndexError:

  In [12]: encoding.toutf8b('\xe0\x80\x20')
  IndexError                                Traceback (most recent call last)
  <ipython-input-12-6e2e56e3584b> in <module>()
  ----> 1 encoding.toutf8b('\xe0\x80\x20')

  mercurial/encoding.py in toutf8b(s)
      459         l = len(s)
      460         while pos1 < l:
  --> 461             if (s2[pos2] == "\xef" and
      462                 s2[pos2:pos2 + 3] == "\xef\xbf\xbd" and
      463                 s[pos1:pos1 + 3] != "\xef\xbf\xbd"):

probably because two bytes are replaced by single '\ufffd' character:

  In [14]: '\xe0\x80\x20'.decode('utf-8', 'replace')
  Out[14]: u'\ufffd '

We might be possible to use the error handler to map invalid chars to \udcxx,
but I've never tried it and it seems the handler table is global.

https://docs.python.org/2.7/library/codecs.html#codecs.register_error