[PATCH] encoding: handle UTF-16 internal limit with fromutf8b (issue5031)

Mon Jan 11 09:05:49 CST 2016

On Sun, 10 Jan 2016 17:09:45 -0600, Matt Mackall wrote:
> # HG changeset patch
> # User Matt Mackall <mpm at selenic.com>
> # Date 1452200277 21600
> #      Thu Jan 07 14:57:57 2016 -0600
> # Node ID 33819d463ddc7895bcd917d85b9ae1ab502c1547
> # Parent  b8405d739149cdd6d8d9bd5e3dd2ad8487b1f09a
> encoding: handle UTF-16 internal limit with fromutf8b (issue5031)
> 
> Default builds of Python have a Unicode type that isn't actually full
> Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP
> codepoints with surrogate escaping. Since our UTF-8b hack escaping
> uses a plane that overlaps with the UTF-16 escaping system, this gets
> extra complicated. In addition, unichr() for codepoints greater than
> U+FFFF may not work either.
> 
> This changes the code to reuse getutf8char to walk the byte string, so we
> only rely on Python for unpacking our U+DCxx characters.

Looks great. Pushed to the clowncopter, thanks.

> @@ -9,6 +9,8 @@
>  
>  import locale
>  import os
> +import struct
> +import sys

Dropped these unused imports from the previous version.