xml style doesn't generate valid xml

Tue Nov 23 16:02:25 CST 2010

On Tue, 2010-11-23 at 21:19 +0000, Haszlakiewicz, Eric wrote:
> >-----Original Message-----
> >From: Matt Mackall [mailto:mpm at selenic.com]
> >
> >> However, for the issue with xmlescape turning things into spaces, I
> >> think that's because there's an explicit line of code in xmlescape
> >> that does that!  In templatefilters.py, the last line of xmlescape is:
> >>     return re.sub('[\x00-\x08\x0B\x0C\x0E-\x1F]', ' ', text)
> >
> >Ugh, who ordered that.
> >
> 
> Well, I poked around a bit and came up with this as a better implementation of xmlescape.  What do you think:
> 
> def xmlescape(text):
>     text = (text
>             .replace('&', '&amp;')
>             .replace('<', '&lt;')
>             .replace('>', '&gt;')
>             .replace('"', '&quot;')
>             .replace("'", '&#39;')) # &apos; invalid in HTML
>     moretoencode = True
>     new_s = ""
>     while moretoencode:
while 1:

>         try:
>             text.decode("UTF-8", "strict")
>             new_s += text
>             moretoencode = False
break
>         except UnicodeDecodeError, inst:
>             preerror = text[0:inst.start]
>             new_s += preerror
>             escaped = "&#" + "%d" % ord(text[inst.start]) + ";"

"&#%d;" %

?

>             new_s += escaped
>             text = text[inst.start+1:]
>     def fixupcontrols(matchobj):
>         return "&#" + "%d" % ord(matchobj.group(0)) + ";"
>     return re.sub('[\x00-\x08\x0B\x0C\x0E-\x1F]', fixupcontrols, new_s)
> 
> eric

I think trying to encode potentially binary data in binary-hostile
Unicode (let alone text-hostile XML) is pretty hopeless. You'd probably
be better off with something like:

def xmlescape(text):
    try:
        u = encoding.fromlocal(text) # convert back to UTF-8
	if containscontrolchars(u):
	    u = repr(text)
    except UnicodeDecodeError:
        u = repr(text)
    u = (u.replace('&', '&amp;')
            .replace('<', '&lt;')
            .replace('>', '&gt;')
            .replace('"', '&quot;')
            .replace("'", '&#39;')) # &apos; invalid in HTML
    return u

In other words, things that can be cleanly converted are, everything
else gets converted to escaped ASCII. You can probably use some XMLish
attribute or tag to mark the escaped ASCII as binary too.

-- 
Mathematics is the supreme nostalgia of our time.