[PATCH 1 of 2] encoding: make utf8b encoder more robust (issue4927)

Fri Nov 6 08:03:56 CST 2015

On Wed, 4 Nov 2015 22:46:26 +0900, Yuya Nishihara wrote:
> We might be possible to use the error handler to map invalid chars to \udcxx,
> but I've never tried it and it seems the handler table is global.
> 
> https://docs.python.org/2.7/library/codecs.html#codecs.register_error

Catching error won't work if the source string contains a valid surrogate-
encoded sequence.

  >>> s = u'\udc00'.encode('utf-8')
  >>> encoding.toutf8b(s)
  '\xed\xb0\x80'  # should be '\xed\xb3\xad\xed\xb2\xb0\xed\xb2\x80' ?
  >>> encoding.fromutf8b(encoding.toutf8b(s))
  '\x00'