xml style doesn't generate valid xml

Matt Mackall mpm at selenic.com
Tue Nov 23 17:58:32 CST 2010


On Tue, 2010-11-23 at 23:51 +0000, Haszlakiewicz, Eric wrote:
> >-----Original Message-----
> >From: Matt Mackall [mailto:mpm at selenic.com]
> >
> >On Tue, 2010-11-23 at 23:19 +0000, Haszlakiewicz, Eric wrote:
> >> >I think trying to encode potentially binary data in binary-hostile
> >> >Unicode (let alone text-hostile XML) is pretty hopeless. You'd probably
> >> >be better off with something like:
> >> >
> >> >def xmlescape(text):
> >> >    try:
> >> >        u = encoding.fromlocal(text) # convert back to UTF-8
> >> >	if containscontrolchars(u):
> >> >	    u = repr(text)
> >>
> >> I think we'd need to call repr all the time.  Otherwise values with
> >backslashes in them won't be correctly parseable.  e.g. with the above code
> >>   xmlescape("\xaa")
> >> and
> >>   xmlescape("\\xaa")
> >>
> >> both map to the four characters: \ x a a
> >> and you can't reverse the mapping.
> >
> >Hence my comment about marking which it is with an XML attribute.
> 
> Oh, I missed that.  That's going to be tricky to do, since xmlescape is just working with the value, not creating the entire tag.
> 
> >
> >> >    except UnicodeDecodeError:
> >> >        u = repr(text)
> >> >    u = (u.replace('&', '&')
> >> >            .replace('<', '&lt;')
> >> >            .replace('>', '&gt;')
> >> >            .replace('"', '&quot;')
> >> >            .replace("'", '&#39;')) # &apos; invalid in HTML
> >> >    return u
> >> >
> >> >In other words, things that can be cleanly converted are, everything
> >> >else gets converted to escaped ASCII. You can probably use some XMLish
> >> >attribute or tag to mark the escaped ASCII as binary too.
> >>
> >> hmm... so anything that wanted to use the value would need to know to
> >> un-repr() it to get the actual value.
> >
> >Yes.
> >
> >> However, I think that fromlocal() call is going to mess things up for
> >> fields other than commit comments, so we probably need something like
> >> two functions: xmlescape_text and xmlescape_bin
> >
> >The thing is: no one knows what's in the extra field. It could be
> >binary, it could be ASCII, it could be UTF-8, it could even be a mix.
> 
> huh?  Since the output I got for the extra field didn't change when I
> switched my locale around I figured that it was being treated as a
> binary field.  How does mercurial decide whether to treat it as binary
> or ASCII or UTF-8?

extra["transplant_whatever"] -> binary
extra["branch"] -> UTF-8. 

The log code doesn't know anything about that and assumes a) everything
is potentially binary and b) it's ok to print it anyway, because that's
what the user asked for.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial mailing list