[PATCH] convert/darcs: support changelogs with bytes 0x7F-0xFF (issue2411)

Sat Oct 2 04:47:33 CDT 2010

On Fri, Oct 1, 2010 at 9:22 PM, Mads Kiilerich <mads at kiilerich.com> wrote:
>  Brodie Rao wrote, On 10/01/2010 05:15 PM:
>>
>> # HG changeset patch
>> # User Brodie Rao<brodie at bitheap.org>
>> # Date 1285946104 18000
>> # Branch stable
>> # Node ID d0a4690bd027fcfd875726a764d4f1d8187fe0f8
>> # Parent  10dcfba4f16d42f2fd80faf91818a0d6607591da
>> convert/darcs: support changelogs with bytes 0x7F-0xFF (issue2411)
>>
>> This is a followup to 4481f8a93c7a, which only fixed the conversion of
>> patches with UTF-8 metadata.
>>
>> This patch allows a changelog to have any bytes with values
>> 0x7F-0xFF. It parses the XML changelog as Latin-1 and uses
>> converter_source.recode() to decode the data as UTF-8/Latin-1.
>
> What a hack! I'm not sure if it is the elegant or ugly kind of hack ;-)
>
> Wouldn't it be simpler to read all the data from the command output stream
> and convert the raw xml to unicode with the "right" decoder and then convert
> to utf-8 to get something that can be parsed? It might increase the memory
> usage, but I doubt it will make a real difference.

There is no "right" decoder. When the XML changelog can contain
arbitrary data and doesn't conform to any single encoding, how are you
going to parse it? Telling the parser to interpret it as Latin-1 seems
like the simplest solution to me.

Doing the "recode" calls after the XML parsing also makes it much more
forgiving about what it allows in the XML. This seems better than
having it just blow up if it can't convert the changelog into a
certain encoding before parsing it.

>> Caveats:
>>
>> - Since the convert extension doesn't provide any way to specify the
>>   source encoding, users are still limited to UTF-8 and Latin-1.
>
> As far as I can see the other converter sources uses encoding.encoding both
> as source and Mercurial encoding. Shouldn't the same be done here?

The cvs and gnuarch converters reference encoding.encoding, but
convcmd sets encoding.encoding to 'UTF-8' while running converters, so
it's pretty much useless.

3145951e50fe added the usage of encoding.encoding in those two
converters (replacing locale.getpreferredencoding). It should probably
be amended to set "encoding = encoding.encoding" at the class level,
instead of in __init__. It won't be overwritten by convcmd at that
point.

> Functionality for changing the encoding while converting could be nice, but
> that should perhaps introduced for all converter sources at once.
>
>> - etree will still complain if the changelog has bytes with values
>>   0x00-0x19. XML only allows printable characters.
>>
>> diff --git a/hgext/convert/darcs.py b/hgext/convert/darcs.py
>> --- a/hgext/convert/darcs.py
>> +++ b/hgext/convert/darcs.py
>> @@ -7,22 +7,22 @@
>>
>>  from common import NoRepo, checktool, commandline, commit,
>> converter_source
>>  from mercurial.i18n import _
>> -from mercurial import util
>> +from mercurial import encoding, util
>>  import os, shutil, tempfile, re
>>
>>  # The naming drift of ElementTree is fun!
>>
>>  try:
>> -    from xml.etree.cElementTree import ElementTree
>> +    from xml.etree.cElementTree import ElementTree, XMLParser
>>  except ImportError:
>>      try:
>> -        from xml.etree.ElementTree import ElementTree
>> +        from xml.etree.ElementTree import ElementTree, XMLParser
>>      except ImportError:
>>          try:
>> -            from elementtree.cElementTree import ElementTree
>> +            from elementtree.cElementTree import ElementTree, XMLParser
>>          except ImportError:
>>              try:
>> -                from elementtree.ElementTree import ElementTree
>> +                from elementtree.ElementTree import ElementTree,
>> XMLParser
>>              except ImportError:
>>                  ElementTree = None
>>
>> @@ -88,12 +88,24 @@ class darcs_source(converter_source, com
>>          self.ui.debug('cleaning up %s\n' % self.tmppath)
>>          shutil.rmtree(self.tmppath, ignore_errors=True)
>>
>> +    def recode(self, s, encoding=None):
>> +        if isinstance(s, unicode):
>> +            # XMLParser returns unicode objects for anything it can't
>> +            # encode into ASCII. We convert them back to str to get
>> +            # recode's normal conversion behavior.
>> +            s = s.encode('latin-1')
>> +        return super(darcs_source, self).recode(s, encoding)
>>
>
> This is where it starts to hurt. Converting to utf-8 before converting would
> make this change go away. Please?

Are you saying I should do s.decode('latin-1').encode('utf-8') before parsing?

>>      def xml(self, cmd, **kwargs):
>>          # NOTE: darcs is currently encoding agnostic and will print
>>          # patch metadata byte-for-byte, even in the XML changelog.
>>          etree = ElementTree()
>> +        # While we are decoding the XML as latin-1 to be as liberal as
>> +        # possible, etree will still raise an exception if any
>> +        # non-printable characters are in the XML changelog.
>> +        parser = XMLParser(encoding='latin-1')
>>          fp = self._run(cmd, **kwargs)
>> -        etree.parse(fp)
>> +        etree.parse(fp, parser=parser)
>>          self.checkexit(fp.close())
>>          return etree.getroot()
>
> /Mads
>