[PATCH stable] convert/darcs: handle non-ASCII metadata in darcs changelog (issue2354)

Tue Aug 31 15:14:32 CDT 2010

  Brodie Rao wrote, On 08/31/2010 09:53 PM:
> # HG changeset patch
> # User Brodie Rao<brodie at bitheap.org>
> # Date 1283284282 18000
> # Branch stable
> # Node ID 5f75837cd52207169d66498c9b045843b2122871
> # Parent  ee601a6264e0e78caa36a43709494669214fccfe
> convert/darcs: handle non-ASCII metadata in darcs changelog (issue2354)
>
> Given a commit author or message with non-ASCII characters in a darcs
> repo, convert would raise a UnicodeEncodeError when adding changesets
> to the hg changelog.
>
> This happened because etree returns back unicode objects for any text
> it can't encode into ASCII. convert was passing these objects to
> changelog.add(), which would then attempt encoding.fromlocal() on
> them.
>
> This patch ensures converter_source.recode() is called on each piece
> of commit data returned by etree.
>
> diff -r ee601a6264e0 -r 5f75837cd522 hgext/convert/darcs.py
> --- a/hgext/convert/darcs.py	Mon Aug 30 22:47:38 2010 +0200
> +++ b/hgext/convert/darcs.py	Tue Aug 31 14:51:22 2010 -0500
> @@ -106,9 +106,11 @@ class darcs_source(converter_source, com
>       def getcommit(self, rev):
>           elt = self.changes[rev]
>           date = util.strdate(elt.get('local_date'), '%a %b %d %H:%M:%S %Z %Y')
> -        desc = elt.findtext('name') + '\n' + elt.findtext('comment', '')
> -        return commit(author=elt.get('author'), date=util.datestr(date),
> -                      desc=desc.strip(), parents=self.parents[rev])
> +        desc = (self.recode(elt.findtext('name')) + '\n' +
> +                self.recode(elt.findtext('comment', '')))

Why call recode twice? Ok, it makes it easier to review that all values 
from elt has been encoded. But the invariant could also be that all 
parameters to commit has been recoded. Never mind...

> +        return commit(author=self.recode(elt.get('author')),

Should we add a test of unicode in author name too?

> +                      date=util.datestr(date), desc=desc.strip(),
> +                      parents=self.parents[rev])
>
>       def pull(self, rev):
>           output, status = self.run('pull', self.path, all=True,
> diff -r ee601a6264e0 -r 5f75837cd522 tests/test-convert-darcs
> --- a/tests/test-convert-darcs	Mon Aug 30 22:47:38 2010 +0200
> +++ b/tests/test-convert-darcs	Tue Aug 31 14:51:22 2010 -0500
> @@ -56,13 +56,17 @@ darcs remove dir/d2
>   rm dir/d2
>   darcs mv dir dir2
>   darcs record -a -l -m p3
> -cd ..
> +
> +echo % test utf-8 metadata
> +echo g>  g
> +darcs record -a -l -m 'p4: ñ' -A 'ñ'

We assume that darcs uses and understands utf-8? Ok ... I guess???

>
>   glog()
>   {
> -    hg glog --template '{rev} "{desc|firstline}" files: {files}\n' "$@"
> +    HGENCODING=utf-8 hg glog --template '{rev} "{desc|firstline}" ({author}) files: {files}\n' "$@"
>   }
>
> +cd ..
>   hg convert darcs-repo darcs-repo-hg

I'm a bit surprised that we don't have to set HGENCODING here. But ok, 
we utilize that it falls back to utf-8 if ascii fails ...

/Mads