[PATCH 1 of 3 RFC] mercurial: implement a source transforming module loader on Python 3

Gregory Szorc gregory.szorc at gmail.com
Mon May 16 02:12:43 EDT 2016


On Sun, May 15, 2016 at 9:02 PM, Gregory Szorc <gregory.szorc at gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1463370916 25200
> #      Sun May 15 20:55:16 2016 -0700
> # Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
> # Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
> mercurial: implement a source transforming module loader on Python 3
>
> The most painful part of ensuring Python code runs on both Python 2
> and 3 is string encoding. Making this difficult is that string
> literals in Python 2 are bytes and string literals in Python 3 are
> unicode. So, to ensure consistent types are used, you have to
> use "from __future__ import unicode_literals" and/or prefix literals
> with their type (e.g. b'foo' or u'foo').
>
> Nearly every string in Mercurial is bytes. So, to use the same source
> code on both Python 2 and 3 would require prefixing nearly every
> string literal with "b" to make it a byte literal. This is ugly and
> not something mpm is willing to do.
>
> This patch implements a custom module loader on Python 3 that performs
> source transformation to convert string literals (unicode in Python 3)
> to byte literals. In effect, it changes Python 3's string literals to
> behave like Python 2's.
>
> The module loader is only used on mercurial.* and hgext.* modules.
>
> The loader works by tokenizing the loaded source and replacing
> "string" tokens if necessary. The modified token stream is
> untokenized back to source and loaded like normal. This does add some
> overhead. However, this all occurs before caching. So .pyc files should
> cache the version with byte literals.
>
> This patch isn't suitable for checkin. There are a few deficiencies,
> including that changes to the loader won't result in the cache
> being invalidated. As part of testing this, I've had to manually
> blow away __pycache__ directories. We'll likely need to hack up
> cache checking as well so caching is invalidated when
> mercurial/__init__.py changes. This is going to be ugly.
>

Slightly more context for this patch.

I initially tried to implement things at the AST level. However, Python's
AST APIs are quite convoluted. lib2to3 has a bit nicer "framework" for
doing source transformations, but the API isn't stable. There are 3rd party
libraries, but I don't want to introduce the dependency.

After realizing AST manipulation would be too much work, I briefly looked
at the "parser" module. It was also a bit gnarly. So, I went to the next
lowest level - the tokenizer module - and realized a solution was pretty
trivial to implement. The cost is it might be too low-level and more
advanced rewriting in the future could be difficult. It might also be a bit
more expensive than AST transforms. But it should be "fast enough,"
especially with .pyc caching.

Now that we're operating at the source level and we're effectively changing
the source string in its entirety, it would be possible to reimplement this
source transformation as a "# coding:" hack as Jun suggested and I earlier
discounted (because I think we'd have to operate at the AST level). I
didn't realize the tokenizer module could facilitate what we needed. Given
the gotchas with hacking up module importing, a "# coding" hack seems
attractive again. Although, I think we still have the .pyc caching problem:
if we change features of the source rewriter, we need a way to tell the
module loader that the .pyc is invalid (.pyc validation looks at source
file size and mtime to check cache freshness). We'd need to use a custom
module loader on Python 3 to provide custom validation functionality. Or
we'd need to do something real ugly like rewrite the installed files to
have a randomly generated "# coding" value. Yuck.

Anyway, before I do any more work on this, I'd like feedback. I'm
optimistic mpm will like it because it feels like the least invasive
approach, even if it does require sprinkling some u'' around the source. We
might even be able to undo some of the iteritems() and xrange() rewrites
we've done...


>
> diff --git a/mercurial/__init__.py b/mercurial/__init__.py
> --- a/mercurial/__init__.py
> +++ b/mercurial/__init__.py
> @@ -139,14 +139,89 @@ class hgimporter(object):
>              if not modinfo:
>                  raise ImportError('could not find mercurial module %s' %
>                                    name)
>
>          mod = imp.load_module(name, *modinfo)
>          sys.modules[name] = mod
>          return mod
>
> +if sys.version_info[0] >= 3:
> +    from . import pure
> +    import importlib
> +    import io
> +    import token
> +    import tokenize
> +
> +    class hgpathentryfinder(importlib.abc.PathEntryFinder):
> +        """A sys.meta_path finder."""
> +        def find_spec(self, fullname, path, target=None):
> +            # Our custom loader rewrites source code and Python code
> +            # that doesn't belong to Mercurial doesn't expect this.
> +            if not fullname.startswith(('mercurial.', 'hgext.')):
> +                return None
> +
> +            # This assumes Python 3 doesn't support loading C modules.
> +            if fullname in _dualmodules:
> +                stem = fullname.split('.')[-1]
> +                fullname = 'mercurial.pure.%s' % stem
> +                target = pure
> +                assert len(path) == 1
> +                path = [os.path.join(path[0], 'pure')]
> +
> +            # Try to find the module using other registered finders.
> +            spec = None
> +            for finder in sys.meta_path:
> +                if finder == self:
> +                    continue
> +
> +                spec = finder.find_spec(fullname, path, target=target)
> +                if spec:
> +                    break
> +
> +            if not spec:
> +                return None
> +
> +            if fullname.startswith('mercurial.pure.'):
> +                spec.name = spec.name.replace('.pure.', '.')
> +
> +            # TODO need to support loaders from alternate specs, like zip
> +            # loaders.
> +            spec.loader = hgloader(spec.name, spec.origin)
> +            return spec
> +
> +    def replacetoken(t):
> +        if t.type == token.STRING:
> +            s = t.string
> +
> +            # If a docstring, keep it as a string literal.
> +            if s[0:3] in ("'''", '"""'):
> +                return t
> +
> +            if s[0] not in ("'", '"'):
> +                return t
> +
> +            # String literal. Prefix to make a b'' string.
> +            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end,
> t.line)
> +
> +        return t
> +
> +    class hgloader(importlib.machinery.SourceFileLoader):
> +        """Custom module loader that transforms source code.
> +
> +        When the source code is converted to code, we first transform
> +        string literals to byte literals using the tokenize API.
> +        """
> +        def source_to_code(self, data, path):
> +            buf = io.BytesIO(data)
> +            tokens = tokenize.tokenize(buf.readline)
> +            data = tokenize.untokenize(replacetoken(t) for t in tokens)
> +            return super(hgloader, self).source_to_code(data, path)
> +
>  # We automagically register our custom importer as a side-effect of
> loading.
>  # This is necessary to ensure that any entry points are able to import
>  # mercurial.* modules without having to perform this registration
> themselves.
> -if not any(isinstance(x, hgimporter) for x in sys.meta_path):
> -    # meta_path is used before any implicit finders and before sys.path.
> -    sys.meta_path.insert(0, hgimporter())
> +if sys.version_info[0] >= 3:
> +    sys.meta_path.insert(0, hgpathentryfinder())
> +else:
> +    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
> +        # meta_path is used before any implicit finders and before
> sys.path.
> +        sys.meta_path.insert(0, hgimporter())
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160515/29bddb0c/attachment.html>


More information about the Mercurial-devel mailing list