xml style doesn't generate valid xml

Matt Mackall mpm at selenic.com
Tue Nov 23 13:05:09 CST 2010


On Tue, 2010-11-23 at 16:57 +0000, Haszlakiewicz, Eric wrote:
> >-----Original Message-----
> >From: Matt Mackall [mailto:mpm at selenic.com]
> >On Mon, 2010-11-22 at 23:31 +0000, Haszlakiewicz, Eric wrote:
> >> It seems like xmlescape isn't escaping everything it needs to.  I'll
> >> bet you'll run into the same problem in other places, such as
> >> filenames, log messages, etc... anywhere where you could have bytes
> >> that aren't necessarily utf8 encoded
> >
> >You raise an interesting point here. By the time a commit message
> >reaches xmlescape, it's already been converted from the UTF-8 we stored
> >it in to the local encoding. And if that encoding isn't UTF-8,
> >converting it back to UTF-8 will be lossy.
> 
> Oh, commit messages have to be in UTF-8?  I didn't realize that.  Sure
> enough, trying to enter arbitrary binary data in a commit message
> results in a "codec can't decode byte" error from mercurial.  That
> seems quite sensible.  (re-encoding back and forth, not so much)
> 
> The same restriction does not appear to apply to filenames.

Indeed. See here: http://mercurial.selenic.com/wiki/EncodingStrategy

> However, for the issue with xmlescape turning things into spaces, I
> think that's because there's an explicit line of code in xmlescape
> that does that!  In templatefilters.py, the last line of xmlescape is:
>     return re.sub('[\x00-\x08\x0B\x0C\x0E-\x1F]', ' ', text)

Ugh, who ordered that.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial mailing list