xml style doesn't generate valid xml

Tue Nov 23 17:26:42 CST 2010

On Tue, 2010-11-23 at 23:19 +0000, Haszlakiewicz, Eric wrote:
> >I think trying to encode potentially binary data in binary-hostile
> >Unicode (let alone text-hostile XML) is pretty hopeless. You'd probably
> >be better off with something like:
> >
> >def xmlescape(text):
> >    try:
> >        u = encoding.fromlocal(text) # convert back to UTF-8
> >	if containscontrolchars(u):
> >	    u = repr(text)
> 
> I think we'd need to call repr all the time.  Otherwise values with backslashes in them won't be correctly parseable.  e.g. with the above code
>   xmlescape("\xaa")
> and
>   xmlescape("\\xaa")
> 
> both map to the four characters: \ x a a
> and you can't reverse the mapping.

Hence my comment about marking which it is with an XML attribute.

> >    except UnicodeDecodeError:
> >        u = repr(text)
> >    u = (u.replace('&', '&amp;')
> >            .replace('<', '&lt;')
> >            .replace('>', '&gt;')
> >            .replace('"', '&quot;')
> >            .replace("'", '&#39;')) # &apos; invalid in HTML
> >    return u
> >
> >In other words, things that can be cleanly converted are, everything
> >else gets converted to escaped ASCII. You can probably use some XMLish
> >attribute or tag to mark the escaped ASCII as binary too.
> 
> hmm... so anything that wanted to use the value would need to know to
> un-repr() it to get the actual value.

Yes.

> However, I think that fromlocal() call is going to mess things up for
> fields other than commit comments, so we probably need something like
> two functions: xmlescape_text and xmlescape_bin

The thing is: no one knows what's in the extra field. It could be
binary, it could be ASCII, it could be UTF-8, it could even be a mix.

-- 
Mathematics is the supreme nostalgia of our time.