[PATCH v2] patch: when importing from email, RFC2047-decode From/Subject headers

Julien Cristau jcristau at debian.org
Thu Mar 3 15:44:26 EST 2016


On Thu, Mar  3, 2016 at 12:49:22 -0600, Matt Mackall wrote:

> On Thu, 2016-03-03 at 18:55 +0100, Julien Cristau wrote:
> > # HG changeset patch
> > # User Julien Cristau <julien.cristau at logilab.fr>
> > # Date 1457026459 -3600
> > #      Thu Mar 03 18:34:19 2016 +0100
> > # Node ID 6c153cbad4a032861417dbba9d1d90332964ab5f
> > # Parent  549ff28a345f595cad7e06fb08c2ac6973e2f030
> > patch: when importing from email, RFC2047-decode From/Subject headers
> > 
> > I'm not too sure about the Subject part: it should be possible to use
> > the charset information from the email (RFC2047 encoding and the
> > Content-Type header), but mercurial seems to use its own encoding
> > instead (in the test, that means the commit message ends up as "????"
> > if the import is done without --encoding utf-8).  Advice welcome.
> > 
> > Reported at https://bugs.debian.org/737498
> 
> You should probably immediately relay such reports upstream.
> 
Indeed.  I spent some time tidying https://bugs.debian.org/src:mercurial
today, and out of the remaining bugs (other than this one), one is a
packaging issue, three are 6 year old zeroconf extension issues (I know
nothing of that extension), another one is a 6 year old demandimport
performance issue which should probably just be closed at this point,
and the rest are either already forwarded to hg bz, or marked wontfix.

New attempt at a fix below which should address your comments, changes
in v2:
- moved decoding to new mercurial.mail.headdecode function
- fall back to utf-8 and latin1 instead of ascii
- rename parts variable to uparts as it contains unicode objects

Thanks,
Julien

# HG changeset patch
# User Julien Cristau <julien.cristau at logilab.fr>
# Date 1457026459 -3600
#      Thu Mar 03 18:34:19 2016 +0100
# Node ID 981e5fd56a9973e0069173b5f6c03639d9e176aa
# Parent  e00e57d836535aadcb13337613d2f891492d8e04
patch: when importing from email, RFC2047-decode From/Subject headers

Reported at https://bugs.debian.org/737498

diff --git a/mercurial/mail.py b/mercurial/mail.py
--- a/mercurial/mail.py
+++ b/mercurial/mail.py
@@ -332,3 +332,21 @@ def mimeencode(ui, s, charsets=None, dis
     if not display:
         s, cs = _encode(ui, s, charsets)
     return mimetextqp(s, 'plain', cs)
+
+def headdecode(s):
+    '''Decodes RFC-2047 header'''
+    uparts = []
+    for part, charset in email.Header.decode_header(s):
+        if charset is not None:
+            try:
+                uparts.append(part.decode(charset))
+                continue
+            except UnicodeDecodeError:
+                pass
+        try:
+            uparts.append(part.decode('UTF-8'))
+            continue
+        except UnicodeDecodeError:
+            pass
+        uparts.append(part.decode('ISO-8859-1'))
+    return encoding.tolocal(u' '.join(uparts).encode('UTF-8'))
diff --git a/mercurial/patch.py b/mercurial/patch.py
--- a/mercurial/patch.py
+++ b/mercurial/patch.py
@@ -31,6 +31,7 @@ from . import (
     diffhelpers,
     encoding,
     error,
+    mail,
     mdiff,
     pathutil,
     scmutil,
@@ -210,8 +211,8 @@ def extract(ui, fileobj):
     try:
         msg = email.Parser.Parser().parse(fileobj)
 
-        subject = msg['Subject']
-        data['user'] = msg['From']
+        subject = msg['Subject'] and mail.headdecode(msg['Subject'])
+        data['user'] = msg['From'] and mail.headdecode(msg['From'])
         if not subject and not data['user']:
             # Not an email, restore parsed headers if any
             subject = '\n'.join(': '.join(h) for h in msg.items()) + '\n'
diff --git a/tests/test-import-git.t b/tests/test-import-git.t
--- a/tests/test-import-git.t
+++ b/tests/test-import-git.t
@@ -822,4 +822,27 @@ Test corner case involving copies and mu
   > EOF
   applying patch from stdin
 
+Test email metadata
+
+  $ hg revert -qa
+  $ hg --encoding utf-8 import - <<EOF
+  > From: =?UTF-8?q?Rapha=C3=ABl=20Hertzog?= <hertzog at debian.org>
+  > Subject: [PATCH] =?UTF-8?q?=C5=A7=E2=82=AC=C3=9F=E1=B9=AA?=
+  > 
+  > diff --git a/a b/a
+  > --- a/a
+  > +++ b/a
+  > @@ -1,1 +1,2 @@
+  >  a
+  > +a
+  > EOF
+  applying patch from stdin
+  $ hg --encoding utf-8 log -r .
+  changeset:   2:* (glob)
+  tag:         tip
+  user:        Rapha\xc3\xabl Hertzog <hertzog at debian.org> (esc)
+  date:        * (glob)
+  summary:     \xc5\xa7\xe2\x82\xac\xc3\x9f\xe1\xb9\xaa (esc)
+  
+
   $ cd ..


More information about the Mercurial-devel mailing list