xml style doesn't generate valid xml

Tue Nov 23 17:51:16 CST 2010

>-----Original Message-----
>From: Matt Mackall [mailto:mpm at selenic.com]
>
>On Tue, 2010-11-23 at 23:19 +0000, Haszlakiewicz, Eric wrote:
>> >I think trying to encode potentially binary data in binary-hostile
>> >Unicode (let alone text-hostile XML) is pretty hopeless. You'd probably
>> >be better off with something like:
>> >
>> >def xmlescape(text):
>> >    try:
>> >        u = encoding.fromlocal(text) # convert back to UTF-8
>> >	if containscontrolchars(u):
>> >	    u = repr(text)
>>
>> I think we'd need to call repr all the time.  Otherwise values with
>backslashes in them won't be correctly parseable.  e.g. with the above code
>>   xmlescape("\xaa")
>> and
>>   xmlescape("\\xaa")
>>
>> both map to the four characters: \ x a a
>> and you can't reverse the mapping.
>
>Hence my comment about marking which it is with an XML attribute.

Oh, I missed that.  That's going to be tricky to do, since xmlescape is just working with the value, not creating the entire tag.

>
>> >    except UnicodeDecodeError:
>> >        u = repr(text)
>> >    u = (u.replace('&', '&amp;')
>> >            .replace('<', '&lt;')
>> >            .replace('>', '&gt;')
>> >            .replace('"', '&quot;')
>> >            .replace("'", '&#39;')) # &apos; invalid in HTML
>> >    return u
>> >
>> >In other words, things that can be cleanly converted are, everything
>> >else gets converted to escaped ASCII. You can probably use some XMLish
>> >attribute or tag to mark the escaped ASCII as binary too.
>>
>> hmm... so anything that wanted to use the value would need to know to
>> un-repr() it to get the actual value.
>
>Yes.
>
>> However, I think that fromlocal() call is going to mess things up for
>> fields other than commit comments, so we probably need something like
>> two functions: xmlescape_text and xmlescape_bin
>
>The thing is: no one knows what's in the extra field. It could be
>binary, it could be ASCII, it could be UTF-8, it could even be a mix.

huh?  Since the output I got for the extra field didn't change when I switched my locale around I figured that it was being treated as a binary field.  How does mercurial decide whether to treat it as binary or ASCII or UTF-8?

eric