RFC: safe pattern matching for problematic encoding

Martin Geisler martin at geisler.net
Sat May 26 10:07:02 CDT 2012


Matt Mackall <mpm at selenic.com> writes:

> On Fri, 2012-05-25 at 12:03 +0200, Martin Geisler wrote:
>> FUJIWARA Katsunori <foozy at lares.dti.ne.jp> writes:
>
>> Okay, I can see why there might be some problems there. But for 99.9%
>> of the cases I think Python's Unicode support is okay. Things that
>> breaks must be pretty obscure, right? In those cases I would tell
>> users that their filename isn't supported.
>
> Try to decode an NFD string into any byte encoding other than UTF-8.
> Not even Python 3 does this right. Hurray for Unicode.

You're being unusually imprecise about how this doesn't work right. But
I'll assume you're talking about this case you mentioned on IRC some
days ago:

<mpm> unicodedata.normalize("NFD", '\xfc'.decode('cp1252')).encode('cp1252') -> FAIL
<mpm> Broken in Py3.2 as well.

To me, it looks like you're confused about Unicode. First, you decompose
'ü' to 'u\u0308' where \u0308 is a combining diaeresis:

  >>> unicodedata.normalize('NFD', u'\xfc')
  u'u\u0308'

That character doesn't exist in cp1252, so the encode call raises
UnicodeEncodeError:

  >>> u'u\u0308'.encode('cp1252')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.7/encodings/cp1252.py", line 12, in encode
      return codecs.charmap_encode(input,errors,encoding_table)
  UnicodeEncodeError: 'charmap' codec can't encode character u'\u0308'
  in position 1: character maps to <undefined>

What else should it return? Maybe you expected the encode method to
compose the characters before encoding? I guess that would be handy, but
the documentation for str.encode doesn't say anything about it being
that friendly.

Trying the same in Python 3.2 fails the same way -- I don't understand
why that is the least bit surprising?

-- 
Martin Geisler

aragost Trifork
Commercial Mercurial support
http://aragost.com/mercurial/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20120526/b1ee0418/attachment.pgp>


More information about the Mercurial-devel mailing list