xml style doesn't generate valid xml

Tue Nov 23 17:19:25 CST 2010

>I think trying to encode potentially binary data in binary-hostile
>Unicode (let alone text-hostile XML) is pretty hopeless. You'd probably
>be better off with something like:
>
>def xmlescape(text):
>    try:
>        u = encoding.fromlocal(text) # convert back to UTF-8
>	if containscontrolchars(u):
>	    u = repr(text)

I think we'd need to call repr all the time.  Otherwise values with backslashes in them won't be correctly parseable.  e.g. with the above code
  xmlescape("\xaa")
and
  xmlescape("\\xaa")

both map to the four characters: \ x a a
and you can't reverse the mapping.

>    except UnicodeDecodeError:
>        u = repr(text)
>    u = (u.replace('&', '&amp;')
>            .replace('<', '&lt;')
>            .replace('>', '&gt;')
>            .replace('"', '&quot;')
>            .replace("'", '&#39;')) # &apos; invalid in HTML
>    return u
>
>In other words, things that can be cleanly converted are, everything
>else gets converted to escaped ASCII. You can probably use some XMLish
>attribute or tag to mark the escaped ASCII as binary too.

hmm... so anything that wanted to use the value would need to know to un-repr() it to get the actual value.  I suppose that's more reasonable than wrestling with xml parsers trying to get them to accept supposedly invalid character references like &#00;.  It's hard enough to get an xml parser that even accepts xml 1.1 to allow all the other character references. (I just tried :( )

However, I think that fromlocal() call is going to mess things up for fields other than commit comments, so we probably need something like two functions: xmlescape_text and xmlescape_bin
Or maybe just an explicit separate "toutf8" filter?

eric