[PATCH 2 of 3 RFC] lfs: convert threshold testing to use a matcher

Fri Dec 29 16:43:23 EST 2017

# HG changeset patch
# User Matt Harbison <matt_harbison at yahoo.com>
# Date 1514528539 18000
#      Fri Dec 29 01:22:19 2017 -0500
# Node ID ab695d4cf960440e92aca0c07ec1753d0c26a56d
# Parent  3bd34c256588b974f8bc2064bf1873a81f2326b9
lfs: convert threshold testing to use a matcher

This is RFC, mostly because I'm interested to see if anyone has any insight into
the seeming bug in fileset.py, and/or thoughts/concerns with this, before I
waste a bunch of time polishing it up.

File pattern matching is offered in git.  I don't see a size based attribute,
but it seems reasonable, and is offered with largefiles.  Both .hgignore and
.hgeol are ultimately converted to a matcher, so that seems like the right path
here.  Using a matcher may also allow for using filesets.  Overridding
localrepo.commitctx() allows the matcher to be created once per commit/
conversion-commit/import --bypass.

It feels dirty monkeying with a matcher on this low layer (internal to filelog),
but I don't see another way.  The only other thing I can think of is a function
that could be overridden which is called in localrepo.commitctx() before it
loops over ctx.modified() + ctx.added(), which could return a map of files ->
flags.  That would have the benefit of 1) not needing to reverse engineer the
filename, and 2) not needing to store the matcher in svfs.options.  The problem
is, that code calls localrepo._filecommit() -> filelog.add() ->
filelog.addrevision().  And it's only here that the flags parameter appears.
I'm not sure how much we should abuse these APIs for a single case.

Paths should always be relative to the root, and I wonder if commitctx() should
be acquiring self.lock() here, before messing with svfs.options.  (I'm not
aware of much threading within a process, but IDK what the future holds, or how
messing with state like this plays with say, chg or command server.)

One additional feature that git has, is to track the configuration in the repo
itself.  A general proposal for tracking config settings was rejected over the
summer over security concerns around external revsets (and filesets?).
Hopefully this doesn't preclude adding a similar .hglfs file.  But since this is
a very specific use case, maybe it would be OK to allow, and filter out any
external filesets.  We would have to filter out filesets anyway if only patterns
were allowed, since they are just another pattern to the matcher.

diff --git a/hgext/lfs/__init__.py b/hgext/lfs/__init__.py
--- a/hgext/lfs/__init__.py
+++ b/hgext/lfs/__init__.py
@@ -121,6 +121,25 @@
     # Push hook
     repo.prepushoutgoinghooks.add('lfs', wrapper.prepush)
 
+    class lfsrepo(repo.__class__):
+        @localrepo.unfilteredmethod
+        def commitctx(self, ctx, error=False):
+            pats = []
+            threshold = self.ui.config('lfs', 'threshold')
+            if threshold:
+                pats.append("set:size('>= %s')" % threshold)
+
+            m = lambda x: False
+            if pats:
+                m = scmutil.match(ctx, pats=pats, default='glob')
+
+            self.svfs.options['matcher'] = m
+            x = super(lfsrepo, self).commitctx(ctx, error)
+            self.svfs.options['matcher'] = lambda x: False
+            return x
+
+    repo.__class__ = lfsrepo
+
     if 'lfs' not in repo.requirements:
         def checkrequireslfs(ui, repo, **kwargs):
             if 'lfs' not in repo.requirements:
diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
--- a/hgext/lfs/wrapper.py
+++ b/hgext/lfs/wrapper.py
@@ -135,14 +135,8 @@
 def filelogaddrevision(orig, self, text, transaction, link, p1, p2,
                        cachedelta=None, node=None,
                        flags=revlog.REVIDX_DEFAULT_FLAGS, **kwds):
-    threshold = self.opener.options['lfsthreshold']
-    textlen = len(text)
-    # exclude hg rename meta from file size
-    meta, offset = filelog.parsemeta(text)
-    if offset:
-        textlen -= offset
-
-    if threshold and textlen > threshold:
+    filename = _getfilename(self)
+    if self.opener.options['matcher'](filename):
         flags |= revlog.REVIDX_EXTSTORED
 
     return orig(self, text, transaction, link, p1, p2, cachedelta=cachedelta,
diff --git a/mercurial/fileset.py b/mercurial/fileset.py
--- a/mercurial/fileset.py
+++ b/mercurial/fileset.py
@@ -380,7 +380,14 @@
     else:
         raise error.ParseError(_("couldn't parse size: %s") % expr)
 
-    return [f for f in mctx.existing() if m(mctx.ctx[f].size())]
+    # This explodes when converting largefiles -> lfs.  It looks like the first
+    # commit converts fine, and then when converting the second commit,
+    # 'lfs.txt' is one of the files tested here, even though it wasn't modified
+    # in this commit (but it was added in the parent).
+
+    # XXX: 'f in mctx.ctx and ...' doesn't short circuit None.size(), so
+    # something appears to be wrong with the ctx for conversions.
+    return [f for f in mctx.existing() if mctx.ctx[f] and m(mctx.ctx[f].size())]
 
 @predicate('encoding(name)', callexisting=True)
 def encoding(mctx, x):