Bug 2354 - unknown exception using convert due to an encoding issue
Summary: unknown exception using convert due to an encoding issue
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: unspecified
Hardware: All All
: normal bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-30 21:46 UTC by Lucas Chiesa
Modified: 2012-05-13 05:04 UTC (History)
6 users (show)

See Also:
Python Version: ---


Attachments
(34 bytes, application/x-gzip)
2010-08-31 12:07 UTC, brodie
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lucas Chiesa 2010-08-30 21:46 UTC
I'm trying to convert a Darcs repo to a mercurial one.

When doing this, I get the following error:
http://paste.lisp.org/display/114070

My system is in utf-8:
$ locale
LANG=es_AR.UTF-8
LC_CTYPE="es_AR.UTF-8"
LC_NUMERIC="es_AR.UTF-8"
LC_TIME="es_AR.UTF-8"
LC_COLLATE="es_AR.UTF-8"
LC_MONETARY="es_AR.UTF-8"
LC_MESSAGES="es_AR.UTF-8"
LC_PAPER="es_AR.UTF-8"
LC_NAME="es_AR.UTF-8"
LC_ADDRESS="es_AR.UTF-8"
LC_TELEPHONE="es_AR.UTF-8"
LC_MEASUREMENT="es_AR.UTF-8"
LC_IDENTIFICATION="es_AR.UTF-8"
LC_ALL=

The offending patch seems to have a character not UTF-8. If I use darcs
changes to see the patch description, y see...

"Saque el tama\c3\b1o en [..]" where the \c3\b1 is the problematic character.

This patch was writing by someone using windows, so it is most likely in
latin1. I tried adding   LANG=es_AR.ISO-8859-1 and  HGENCODING=latin-1 to
the convert line with the same output.

Thanks!
Comment 1 kiilerix 2010-08-31 02:53 UTC
$ hg convert usbtinyisp2/ programador/
scanning source...
sorting...
converting...
9 Saque el tamaño en la funcion SPI,ahora viene en la trama de configure. NO
ANDA
source: 20080716021925-6d91c-632461a7d8c5c5462f7bc19fa66dd86c88b37b1a.gz
spi/main.c
transaction abort!
rollback completed
** unknown exception encountered, details follow
** report bug details to http://mercurial.selenic.com/bts/
** or mercurial@selenic.com
** Python 2.6.6rc1+ (r266rc1:83691, Aug  5 2010, 17:07:04) [GCC 4.4.5
20100728 (prerelease)]
** Mercurial Distributed SCM (version 1.6.2)
** Extensions loaded: convert
Traceback (most recent call last):
  File "/usr/bin/hg", line 27, in <module>
    mercurial.dispatch.run()
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 16, in run
    sys.exit(dispatch(sys.argv[1:]))
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 34, in
dispatch
    return _runcatch(u, args)
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 54, in
_runcatch
    return _dispatch(ui, args)
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 490, in
_dispatch
    cmdpats, cmdoptions)
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 351, in
runcommand
    ret = _runcommand(ui, options, cmd, d)
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 541, in
_runcommand
    return checkargs()
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 495, in
checkargs
    return cmdfunc()
  File "/usr/lib/pymodules/python2.6/mercurial/dispatch.py", line 488, in
<lambda>
    d = lambda: util.checksignature(func)(ui, *args, **cmdoptions)
  File "/usr/lib/pymodules/python2.6/mercurial/util.py", line 420, in check
    return func(*args, **kwargs)
  File "/usr/lib/pymodules/python2.6/hgext/convert/__init__.py", line 243,
in convert
    return convcmd.convert(ui, src, dest, revmapfile, **opts)
  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 429, in
convert
    c.convert(sortmode)
  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 359, in
convert
    self.copy(c)
  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 328, in
copy
    source, self.map)
  File "/usr/lib/pymodules/python2.6/hgext/convert/hg.py", line 171, in
putcommit
    self.repo.commitctx(ctx)
  File "/usr/lib/pymodules/python2.6/mercurial/localrepo.py", line 966, in
commitctx
    user, ctx.date(), ctx.extra().copy())
  File "/usr/lib/pymodules/python2.6/mercurial/changelog.py", line 215, in add
    user, desc = encoding.fromlocal(user), encoding.fromlocal(desc)
  File "/usr/lib/pymodules/python2.6/mercurial/encoding.py", line 63, in
fromlocal
    return s.decode(encoding, encodingmode).encode("utf-8")
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position
13: ordinal not in range(128)
Comment 2 kiilerix 2010-08-31 03:18 UTC
"\xc3\xb1" is the utf-8 encoding of u'\xf1' (~n), so i don't think there is
any latin-1 issue here.

The failure happens while converting u'\xf1' to utf-8, but it is strange
that the ascii encoder gives an error message even though we are using the
utf-8 encoder.

Do this work for you:
python -c 'print repr(u"\xf1".encode("utf-8"))'
?
Comment 3 Lucas Chiesa 2010-08-31 07:55 UTC
Hi!

The python command works:

$ python -c 'print repr(u"\xf1".encode("utf-8"))'
'\xc3\xb1'

Thanks
Comment 4 brodie 2010-08-31 12:07 UTC
I'm attaching a test repo that fails. It's a fresh repo with just one dummy changeset 
that has "Saque el tamaño" as the commit message.

I get the same error with ee601a6264e0 on Python 2.6.5/Ubuntu 10.04.

Steps I took to create the repo:

  $ mkdir foo
  $ cd foo
  $ darcs init
  $ echo a > a
  $ darcs add a
  $ darcs record -m 'Saque el tamaño'

Then to convert:

  $ cd ..
  $ hg convert foo foo-hg

My LANG was set to en_US.UTF-8 when I ran those commands.
Comment 5 brodie 2010-08-31 13:06 UTC
It looks like xml.etree.ElementTree.XMLParser by default assumes all input is UTF-8 (or 
something similar), and wherever it returns text from the document it tries to encode that 
text into ASCII. If that fails, it returns unicode objects.

So that commit message with "~n" gets passed into changelog.add() as a unicode object, and it 
blows up trying to do encoding.fromlocal().

Another thing to keep in mind is that the XML changelog from darcs is what's in each patch 
verbatim; there's no consistent encoding, despite it being XML. etree will raise SyntaxError 
for data that isn't valid UTF-8 from what I can tell.
Comment 6 kiilerix 2010-08-31 13:29 UTC
This seems to fix it:

--- a/hgext/convert/darcs.py
+++ b/hgext/convert/darcs.py
@@ -108,7 +108,7 @@
         date = util.strdate(elt.get('local_date'), '%a %b %d %H:%M:%S %Z %Y')
         desc = elt.findtext('name') + '\n' + elt.findtext('comment', '')
         return commit(author=elt.get('author'), date=util.datestr(date),
-                      desc=desc.strip(), parents=self.parents[rev])
+                      desc=self.recode(desc.strip()),
parents=self.parents[rev])
 
     def pull(self, rev):
         output, status = self.run('pull', self.path, all=True,

(I hadn't seen Brodies last mail, so there might be some duplicate work here
...)
Comment 7 HG Bot 2010-09-12 07:00 UTC
Fixed by http://hg.intevation.org/mercurial/crew/rev/4481f8a93c7a
Brodie Rao <brodie@bitheap.org>
convert/darcs: handle non-ASCII metadata in darcs changelog (issue2354)
Comment 8 Bugzilla 2012-05-12 09:12 UTC

--- Bug imported by bugzilla@serpentine.com 2012-05-12 09:12 EDT  ---

This bug was previously known as _bug_ 2354 at http://mercurial.selenic.com/bts/issue2354
Imported an attachment (id=1451)