[PATCH] encoding: handle UTF-16 internal limit with fromutf8b (issue5031)
Matt Mackall
mpm at selenic.com
Sun Jan 10 23:09:45 UTC 2016
# HG changeset patch
# User Matt Mackall <mpm at selenic.com>
# Date 1452200277 21600
# Thu Jan 07 14:57:57 2016 -0600
# Node ID 33819d463ddc7895bcd917d85b9ae1ab502c1547
# Parent b8405d739149cdd6d8d9bd5e3dd2ad8487b1f09a
encoding: handle UTF-16 internal limit with fromutf8b (issue5031)
Default builds of Python have a Unicode type that isn't actually full
Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP
codepoints with surrogate escaping. Since our UTF-8b hack escaping
uses a plane that overlaps with the UTF-16 escaping system, this gets
extra complicated. In addition, unichr() for codepoints greater than
U+FFFF may not work either.
This changes the code to reuse getutf8char to walk the byte string, so we
only rely on Python for unpacking our U+DCxx characters.
diff -r b8405d739149 -r 33819d463ddc mercurial/encoding.py
--- a/mercurial/encoding.py Sat Jan 02 02:13:56 2016 +0100
+++ b/mercurial/encoding.py Thu Jan 07 14:57:57 2016 -0600
@@ -9,6 +9,8 @@
import locale
import os
+import struct
+import sys
import unicodedata
from . import (
@@ -516,17 +518,27 @@
True
>>> roundtrip("\\xef\\xef\\xbf\\xbd")
True
+ >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80")
+ True
'''
# fast path - look for uDxxx prefixes in s
if "\xed" not in s:
return s
- u = s.decode("utf-8")
+ # We could do this with the unicode type but some Python builds
+ # use UTF-16 internally (issue5031) which causes non-BMP code
+ # points to be escaped. Instead, we use our handy getutf8char
+ # helper again to walk the string without "decoding" it.
+
r = ""
- for c in u:
- if ord(c) & 0xffff00 == 0xdc00:
- r += chr(ord(c) & 0xff)
- else:
- r += c.encode("utf-8")
+ pos = 0
+ l = len(s)
+ while pos < l:
+ c = getutf8char(s, pos)
+ pos += len(c)
+ # unescape U+DCxx characters
+ if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf":
+ c = chr(ord(c.decode("utf-8")) & 0xff)
+ r += c
return r
More information about the Mercurial-devel
mailing list