[PATCH] convert/darcs: support changelogs with bytes 0x7F-0xFF (issue2411)

Fri Oct 1 20:22:38 CDT 2010

  Brodie Rao wrote, On 10/01/2010 05:15 PM:
> # HG changeset patch
> # User Brodie Rao<brodie at bitheap.org>
> # Date 1285946104 18000
> # Branch stable
> # Node ID d0a4690bd027fcfd875726a764d4f1d8187fe0f8
> # Parent  10dcfba4f16d42f2fd80faf91818a0d6607591da
> convert/darcs: support changelogs with bytes 0x7F-0xFF (issue2411)
>
> This is a followup to 4481f8a93c7a, which only fixed the conversion of
> patches with UTF-8 metadata.
>
> This patch allows a changelog to have any bytes with values
> 0x7F-0xFF. It parses the XML changelog as Latin-1 and uses
> converter_source.recode() to decode the data as UTF-8/Latin-1.

What a hack! I'm not sure if it is the elegant or ugly kind of hack ;-)

Wouldn't it be simpler to read all the data from the command output 
stream and convert the raw xml to unicode with the "right" decoder and 
then convert to utf-8 to get something that can be parsed? It might 
increase the memory usage, but I doubt it will make a real difference.

> Caveats:
>
> - Since the convert extension doesn't provide any way to specify the
>    source encoding, users are still limited to UTF-8 and Latin-1.

As far as I can see the other converter sources uses encoding.encoding 
both as source and Mercurial encoding. Shouldn't the same be done here?

Functionality for changing the encoding while converting could be nice, 
but that should perhaps introduced for all converter sources at once.

> - etree will still complain if the changelog has bytes with values
>    0x00-0x19. XML only allows printable characters.
>
> diff --git a/hgext/convert/darcs.py b/hgext/convert/darcs.py
> --- a/hgext/convert/darcs.py
> +++ b/hgext/convert/darcs.py
> @@ -7,22 +7,22 @@
>
>   from common import NoRepo, checktool, commandline, commit, converter_source
>   from mercurial.i18n import _
> -from mercurial import util
> +from mercurial import encoding, util
>   import os, shutil, tempfile, re
>
>   # The naming drift of ElementTree is fun!
>
>   try:
> -    from xml.etree.cElementTree import ElementTree
> +    from xml.etree.cElementTree import ElementTree, XMLParser
>   except ImportError:
>       try:
> -        from xml.etree.ElementTree import ElementTree
> +        from xml.etree.ElementTree import ElementTree, XMLParser
>       except ImportError:
>           try:
> -            from elementtree.cElementTree import ElementTree
> +            from elementtree.cElementTree import ElementTree, XMLParser
>           except ImportError:
>               try:
> -                from elementtree.ElementTree import ElementTree
> +                from elementtree.ElementTree import ElementTree, XMLParser
>               except ImportError:
>                   ElementTree = None
>
> @@ -88,12 +88,24 @@ class darcs_source(converter_source, com
>           self.ui.debug('cleaning up %s\n' % self.tmppath)
>           shutil.rmtree(self.tmppath, ignore_errors=True)
>
> +    def recode(self, s, encoding=None):
> +        if isinstance(s, unicode):
> +            # XMLParser returns unicode objects for anything it can't
> +            # encode into ASCII. We convert them back to str to get
> +            # recode's normal conversion behavior.
> +            s = s.encode('latin-1')
> +        return super(darcs_source, self).recode(s, encoding)
>

This is where it starts to hurt. Converting to utf-8 before converting 
would make this change go away. Please?

>       def xml(self, cmd, **kwargs):
>           # NOTE: darcs is currently encoding agnostic and will print
>           # patch metadata byte-for-byte, even in the XML changelog.
>           etree = ElementTree()
> +        # While we are decoding the XML as latin-1 to be as liberal as
> +        # possible, etree will still raise an exception if any
> +        # non-printable characters are in the XML changelog.
> +        parser = XMLParser(encoding='latin-1')
>           fp = self._run(cmd, **kwargs)
> -        etree.parse(fp)
> +        etree.parse(fp, parser=parser)
>           self.checkexit(fp.close())
>           return etree.getroot()

/Mads