[PATCH 1 of 2] encoding: make utf8b encoder more robust (issue4927)

Fri Nov 6 11:09:35 CST 2015

On Fri, 2015-11-06 at 23:03 +0900, Yuya Nishihara wrote:
> On Wed, 4 Nov 2015 22:46:26 +0900, Yuya Nishihara wrote:
> > We might be possible to use the error handler to map invalid chars
> > to \udcxx,
> > but I've never tried it and it seems the handler table is global.
> > 
> > https://docs.python.org/2.7/library/codecs.html#codecs.register_err
> > or
> 
> Catching error won't work if the source string contains a valid
> surrogate-
> encoded sequence.
> 
>   >>> s = u'\udc00'.encode('utf-8')
>   >>> encoding.toutf8b(s)
>   '\xed\xb0\x80'  # should be '\xed\xb3\xad\xed\xb2\xb0\xed\xb2\x80'
> ?
>   >>> encoding.fromutf8b(encoding.toutf8b(s))
>   '\x00'

Don't worry, I've got a stack of changes to fix this that handles a
thorough fuzz-testing.

-- 
Mathematics is the supreme nostalgia of our time.