xml style doesn't generate valid xml

Matt Mackall mpm at selenic.com
Mon Nov 22 22:35:32 CST 2010


On Mon, 2010-11-22 at 23:31 +0000, Haszlakiewicz, Eric wrote:
> >-----Original Message-----
> >From: Matt Mackall [mailto:mpm at selenic.com]
> >
> >On Mon, 2010-11-22 at 22:49 +0000, Haszlakiewicz, Eric wrote:
> >> I'm not going to paste the actual output into an email because it's
> >> not plain text.  However, here's a sample of what "less log.out"
> >> displays, minus the terminal dependent highlighting to distinguish
> >> between specially displayed characters and actual angle brackets:
> >> <extra
> >key="transplant_source"><CD><B9><AA>Q<FF><A3>}<AC>JI<D5><8E><D8>zWL,<FB>;<F
> >7></extra>
> >
> >Thanks. This tells me about 10 times more than your original message.
> >Perhaps a CDATA section is appropriate here:
> >
> >diff -r 77aa74fe0e0b mercurial/templates/map-cmdline.xml
> >--- a/mercurial/templates/map-cmdline.xml	Mon Nov 22 13:11:46 2010 -0600
> >+++ b/mercurial/templates/map-cmdline.xml	Mon Nov 22 17:20:47 2010 -0600
> >@@ -16,4 +16,4 @@
> > parent = '<parent revision="{rev}" node="{node}" />\n'
> > branch = '<branch>{branch|xmlescape}</branch>\n'
> > tag = '<tag>{tag|xmlescape}</tag>\n'
> >-extra = '<extra key="{key|xmlescape}">{value|xmlescape}</extra>\n'
> >+extra = '<extra key="{key|xmlescape}"><![CDATA[{value}]]></extra>\n'
> 
> Yeah, that's not going to help.  There's nothing that prevents the
> value from having two square brackets in it and ending the CDATA
> section early.  Also, that doesn't do anything about the encoding
> issue.  A CDATA section indicates that the characters are not to be
> parsed by the xml parser, but they need to be valid characters in the
> first place.

I see.

> It seems like xmlescape isn't escaping everything it needs to.  I'll
> bet you'll run into the same problem in other places, such as
> filenames, log messages, etc... anywhere where you could have bytes
> that aren't necessarily utf8 encoded

You raise an interesting point here. By the time a commit message
reaches xmlescape, it's already been converted from the UTF-8 we stored
it in to the local encoding. And if that encoding isn't UTF-8,
converting it back to UTF-8 will be lossy.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial mailing list