[PATCH 2 of 2 V2] filterlang: add a small language to filter files

Matt Harbison mharbison72 at gmail.com
Thu Jan 11 00:17:39 EST 2018


# HG changeset patch
# User Matt Harbison <matt_harbison at yahoo.com>
# Date 1515641014 18000
#      Wed Jan 10 22:23:34 2018 -0500
# Node ID 548e748cb3f4eea0aedb36a2b2e9fe3b77ffb263
# Parent  962b2bdd70d094ce4bf9a8135495788166b04510
filterlang: add a small language to filter files

This patch was inspired by one that Jun Wu authored for the fb-experimental
repo, to avoid using matcher for efficiency[1].  We want a way to specify what
files will be converted to LFS at commit time.  And per discussion, we also want
to specify what files to skip, text diff, or merge in another config option.
The current `lfs.threshold` config option could not satisfy complex needs.  I'm
putting it in a core package because Augie floated the idea of also using it for
narrow and sparse too.

Yuya suggested farming out to fileset.parse(), which added support for more
symbols.  The only fileset symbol not used is '-'.  I can see a use for it, but
haven't figure out how to implement it yet.  I also made the 'always' token a
predicate for consistency, and introduced 'never' to improve readability.
Finally, I changed the extension operator from '.' to '*'.  This matches how git
tracks by extension, but might be slightly confusing here because '**' recurses
in Mercurial, but '*' usually doesn't.

Supporting all of the comparison operators in size() may seem like overkill in
this context.  But I think it is important to do for consistency, and it will
make sense if 'minus' is implemented.

There are probably fileset accessors that should be called (or copied), instead
of accessing the tree directly.  I tried doing that for handling the path
symbol.  But I must have missed a layer in fileset, because fileset.getset()
calls 'methods[x[0]](mctx, *x[1:])' to dispatch a function, but here the name is
in tree[1][1], and the args in tree[2].

Sample filter settings:

  always()                  # everything
  size(">20MB")             # larger than 20MB
  !*.txt                    # except for .txt files
  *.zip | *.tar.gz | *.7z   # some types of compressed files
  /bin                      # files under "bin" in the project root
  (*.php & size(">2MB")) | (/bin & !/bin/README) | size(">1GB")

[1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-December/109387.html

diff --git a/mercurial/filterlang.py b/mercurial/filterlang.py
new file mode 100644
--- /dev/null
+++ b/mercurial/filterlang.py
@@ -0,0 +1,73 @@
+# filterlang.py - a simple language to select files
+#
+# Copyright 2017 Facebook, Inc.
+#
+# This software may be used and distributed according to the terms of the
+# GNU General Public License version 2 or any later version.
+
+from __future__ import absolute_import
+
+from . import (
+    error,
+    fileset,
+    util,
+)
+
+def _compile(tree):
+    op = tree[0]
+    if op in ('symbol', 'string'):
+        name = fileset.getstring(tree, 'invalid file pattern')
+        op = name[0]
+        if op == '*': # file extension test, ex. "*.tar.gz"
+            return lambda n, s: n.endswith(name[1:])
+        elif op == '/': # directory or full path test
+            p = name[1:].rstrip('/') # prefix
+            pl = len(p)
+            f = lambda n, s: n.startswith(p) and (len(n) == pl or n[pl] == '/')
+            return f
+        else:
+            raise error.ParseError('invalid symbol: %s' % name)
+    elif op in ['or', 'and']:
+        funcs = [_compile(t) for t in tree[1:]]
+        summary = {'or': any, 'and': all}[op]
+        return lambda n, s: summary(f(n, s) for f in funcs)
+    elif op == 'not':
+        return lambda n, s: not _compile(tree[1])(n, s)
+    elif op == 'group':
+        return _compile(tree[1])
+    elif op == 'func':
+        name = tree[1][1]
+        symbols = {
+            'always': lambda n, s: True,
+            'never': lambda n, s: False,
+            'size': lambda n, s: fileset.sizematcher(tree[2])(s),
+        }
+
+        if name in symbols:
+            return symbols[name]
+
+        raise error.UnknownIdentifier(name, symbols.keys())
+    elif op in ('negate', 'minus'):
+        raise error.ParseError('unsupported operator: %s' % '-')
+    elif op in ('list'):
+        raise error.ParseError(_("can't use a list in this context"),
+                               hint=_('see hg help "filesets.x or y"'))
+    else:
+        raise error.ProgrammingError('illegal tree: %r' % (tree,))
+
+def compile(text):
+    """generate a function (path, size) -> bool from filter specification.
+
+    "text" could contain the operators defined by the fileset language for
+    common logic operations, and parenthesis for grouping.  The supported path
+    predicates are "*.extname" for file extension test, and "/dir/subdir" for
+    directory test.  The ``size()`` predicate is borrowed from filesets to test
+    file size.  The predicates ``always()`` and ``never()`` are also supported.
+
+    For example, '(*.php & size(">10MB")) | *.zip | (/bin & !/bin/README)" will
+    catch all php files whose size is greater than 10 MB, all files whose name
+    ends with ".zip", and all files under "bin" in the repo root except for
+    "bin/README".
+    """
+    tree = fileset.parse(text)
+    return _compile(tree)
diff --git a/tests/test-filterlang.py b/tests/test-filterlang.py
new file mode 100644
--- /dev/null
+++ b/tests/test-filterlang.py
@@ -0,0 +1,36 @@
+from __future__ import absolute_import
+from __future__ import print_function
+
+import os
+import sys
+
+# make it runnable directly without run-tests.py
+sys.path[0:0] = [os.path.join(os.path.dirname(__file__), '..')]
+
+from mercurial import filterlang
+
+def check(text, truecases, falsecases):
+    f = filterlang.compile(text)
+    for args in truecases:
+        if not f(*args):
+            print('unexpected: %r should include %r' % (text, args))
+    for args in falsecases:
+        if f(*args):
+            print('unexpected: %r should exclude %r' % (text, args))
+
+check('always()', [('a.php', 123), ('b.txt', 0)], [])
+check('never()', [], [('a.php', 123), ('b.txt', 0)])
+check('!!!!((!(!!always())))', [], [('a.php', 123), ('b.txt', 0)])
+
+check('size(">20")', [('a.php', 123)], [('b.txt', 0)])
+
+check('/a & (*.b | *.c)', [('a/b.b', 0), ('a/c.c', 0)], [('b/c.c', 0)])
+check('(/a & *.b) | *.c', [('a/b.b', 0), ('a/c.c', 0), ('b/c.c', 0)], [])
+
+check('!!*.bin or size(">20B") + /bin or !size(">10") | never()',
+      [('a.bin', 11), ('b.txt', 21), ('bin/abc', 11)],
+      [('a.notbin', 11), ('b.txt', 11), ('bin2/abc', 11)])
+
+check('(*.php and size(">10KB")) | *.zip | (/bin & !/bin/README) | size(">1M")',
+      [('a.php', 15000), ('a.zip', 0), ('bin/a', 0), ('bin/README', 1e7)],
+      [('a.php', 5000), ('b.zip2', 0), ('t/bin/a', 0), ('bin/README', 1)])


More information about the Mercurial-devel mailing list