[PATCH 1 of 3 RFC] mercurial: implement a source transforming module loader on Python 3

Gregory Szorc gregory.szorc at gmail.com
Mon May 16 04:02:51 UTC 2016


# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1463370916 25200
#      Sun May 15 20:55:16 2016 -0700
# Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
# Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
mercurial: implement a source transforming module loader on Python 3

The most painful part of ensuring Python code runs on both Python 2
and 3 is string encoding. Making this difficult is that string
literals in Python 2 are bytes and string literals in Python 3 are
unicode. So, to ensure consistent types are used, you have to
use "from __future__ import unicode_literals" and/or prefix literals
with their type (e.g. b'foo' or u'foo').

Nearly every string in Mercurial is bytes. So, to use the same source
code on both Python 2 and 3 would require prefixing nearly every
string literal with "b" to make it a byte literal. This is ugly and
not something mpm is willing to do.

This patch implements a custom module loader on Python 3 that performs
source transformation to convert string literals (unicode in Python 3)
to byte literals. In effect, it changes Python 3's string literals to
behave like Python 2's.

The module loader is only used on mercurial.* and hgext.* modules.

The loader works by tokenizing the loaded source and replacing
"string" tokens if necessary. The modified token stream is
untokenized back to source and loaded like normal. This does add some
overhead. However, this all occurs before caching. So .pyc files should
cache the version with byte literals.

This patch isn't suitable for checkin. There are a few deficiencies,
including that changes to the loader won't result in the cache
being invalidated. As part of testing this, I've had to manually
blow away __pycache__ directories. We'll likely need to hack up
cache checking as well so caching is invalidated when
mercurial/__init__.py changes. This is going to be ugly.

diff --git a/mercurial/__init__.py b/mercurial/__init__.py
--- a/mercurial/__init__.py
+++ b/mercurial/__init__.py
@@ -139,14 +139,89 @@ class hgimporter(object):
             if not modinfo:
                 raise ImportError('could not find mercurial module %s' %
                                   name)
 
         mod = imp.load_module(name, *modinfo)
         sys.modules[name] = mod
         return mod
 
+if sys.version_info[0] >= 3:
+    from . import pure
+    import importlib
+    import io
+    import token
+    import tokenize
+
+    class hgpathentryfinder(importlib.abc.PathEntryFinder):
+        """A sys.meta_path finder."""
+        def find_spec(self, fullname, path, target=None):
+            # Our custom loader rewrites source code and Python code
+            # that doesn't belong to Mercurial doesn't expect this.
+            if not fullname.startswith(('mercurial.', 'hgext.')):
+                return None
+
+            # This assumes Python 3 doesn't support loading C modules.
+            if fullname in _dualmodules:
+                stem = fullname.split('.')[-1]
+                fullname = 'mercurial.pure.%s' % stem
+                target = pure
+                assert len(path) == 1
+                path = [os.path.join(path[0], 'pure')]
+
+            # Try to find the module using other registered finders.
+            spec = None
+            for finder in sys.meta_path:
+                if finder == self:
+                    continue
+
+                spec = finder.find_spec(fullname, path, target=target)
+                if spec:
+                    break
+
+            if not spec:
+                return None
+
+            if fullname.startswith('mercurial.pure.'):
+                spec.name = spec.name.replace('.pure.', '.')
+
+            # TODO need to support loaders from alternate specs, like zip
+            # loaders.
+            spec.loader = hgloader(spec.name, spec.origin)
+            return spec
+
+    def replacetoken(t):
+        if t.type == token.STRING:
+            s = t.string
+
+            # If a docstring, keep it as a string literal.
+            if s[0:3] in ("'''", '"""'):
+                return t
+
+            if s[0] not in ("'", '"'):
+                return t
+
+            # String literal. Prefix to make a b'' string.
+            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end, t.line)
+
+        return t
+
+    class hgloader(importlib.machinery.SourceFileLoader):
+        """Custom module loader that transforms source code.
+
+        When the source code is converted to code, we first transform
+        string literals to byte literals using the tokenize API.
+        """
+        def source_to_code(self, data, path):
+            buf = io.BytesIO(data)
+            tokens = tokenize.tokenize(buf.readline)
+            data = tokenize.untokenize(replacetoken(t) for t in tokens)
+            return super(hgloader, self).source_to_code(data, path)
+
 # We automagically register our custom importer as a side-effect of loading.
 # This is necessary to ensure that any entry points are able to import
 # mercurial.* modules without having to perform this registration themselves.
-if not any(isinstance(x, hgimporter) for x in sys.meta_path):
-    # meta_path is used before any implicit finders and before sys.path.
-    sys.meta_path.insert(0, hgimporter())
+if sys.version_info[0] >= 3:
+    sys.meta_path.insert(0, hgpathentryfinder())
+else:
+    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
+        # meta_path is used before any implicit finders and before sys.path.
+        sys.meta_path.insert(0, hgimporter())


More information about the Mercurial-devel mailing list