xml style doesn't generate valid xml

Tue Nov 23 10:57:53 CST 2010

>-----Original Message-----
>From: Matt Mackall [mailto:mpm at selenic.com]
>On Mon, 2010-11-22 at 23:31 +0000, Haszlakiewicz, Eric wrote:
>> It seems like xmlescape isn't escaping everything it needs to.  I'll
>> bet you'll run into the same problem in other places, such as
>> filenames, log messages, etc... anywhere where you could have bytes
>> that aren't necessarily utf8 encoded
>
>You raise an interesting point here. By the time a commit message
>reaches xmlescape, it's already been converted from the UTF-8 we stored
>it in to the local encoding. And if that encoding isn't UTF-8,
>converting it back to UTF-8 will be lossy.

Oh, commit messages have to be in UTF-8?  I didn't realize that.  Sure enough, trying to enter arbitrary binary data in a commit message results in a "codec can't decode byte" error from mercurial.  That seems quite sensible.  (re-encoding back and forth, not so much)

The same restriction does not appear to apply to filenames.  You can create a file with any name, but then the hg output has the same problem with creating invalid xml as my original example:
  perl -e 'printf("echo foo > %c%c.txt", 5, 200);' > cmd.sh
  sh cmd.sh
  hg add *txt
  hg ci -m test1
  hg log -r tip --style xml -v > log.out
  xmllint --format log.out

Based on some testing with switching my locale back and forth mercurial actually does *not* try to switch the encoding when displaying filenames (or values in "extra") even though it does for commit messages. 

For the commit messages, I guess template map files would need some way to indicate that they don't want the automatic encoding conversion to happen.  I couldn't figure out where that occurs.

However, for the issue with xmlescape turning things into spaces, I think that's because there's an explicit line of code in xmlescape that does that!  In templatefilters.py, the last line of xmlescape is:
    return re.sub('[\x00-\x08\x0B\x0C\x0E-\x1F]', ' ', text)
To just get all the characters a "fix" would be to simply "return text", but of course then it still generates invalid xml.
The real fix would mean figuring out which characters are valid in the currently selected encoding (or assume utf8 for xml), and emit an character reference for those that aren't.  I don't know enough about coding in python or within the mercurial code to be able to figure out how to do that.

eric