[PATCH 5 of 5 RFC] lfs: control tracked file selection via a tracked file

Matt Harbison mharbison72 at gmail.com
Sun Jan 14 23:58:52 EST 2018


# HG changeset patch
# User Matt Harbison <matt_harbison at yahoo.com>
# Date 1515971571 18000
#      Sun Jan 14 18:12:51 2018 -0500
# Node ID 3b651cef0884ad8108a19c6354d53103e378e12e
# Parent  7e5d513b38c169856b51b6207d992abbef10d2d5
lfs: control tracked file selection via a tracked file

Since the lfs tracking policy can dramatically affect the repository, it makes
more sense to have the policy file checked in, than to rely on all developers
configuring their .hgrc properly.  The inspiration for this is the .hgeol file.
The configuration lives under '[track]', so that other things can be added in
the future.  Eventually, the config option should be limited to `convert` only.

I'm sure that this is going to take a few iterations, so I'm ignoring the
documentation for now.  The tests have a bit of commentary, but the general idea
is to put the most specific config first.  Walk that list, and when the first
key matches, the value on the right side becomes the deciding expression.  That
means a general catchall rule can be added at the end.  See the test for an
example.

My initial thought was to read the file and change each "key = value" line into
"((key) & (value))", so that each line could be ORed together, and make a single
pass at compiling.  Unfortunately, that prevents exclusions if there's a
catchall rule.  Consider what happens to a large *.c file here:

  [track]
  **.c = none()
  ** = size('>1MB')
  # ((**.c) & (none())) | ((**) & (size('>1MB'))) => anything > 1MB

I also thought about having separate [include] and [exclude] sections.  But that
just seems to open things up to user mistakes.  Consider:

  [include]
  **.zip = all()
  **.php = size('>10MB')

  [exclude]
  **.zip = all()  # Who wins?
  **.php = none() # Effectively 'all()' (i.e. nothing excluded), or >10MB ?

Therefore, it just compiles each key and value separately, and walks until the
key matches something.  I'm not sure how to enforce just file patterns on LHS
without leaking knowledge about the minifileset here.  That means this will
allow odd looking lines like this:

  [track]
  **.c | **.txt = none()

But that's also fewer lines to compile, so slightly more efficient?  Some things
like 'none()' won't work as expected on LHS though, because that won't match, so
that line is skipped.

Jun previously expressed concern about efficiency when scaling to large repos,
so I tried avoiding 'repo[None]'.  (localrepo.commit() gets repo[None] already,
but doesn't tie it to the workingcommitctx used here.)  Therefore, I looked at
the passed context for 'AMR' status.  But that doesn't help with the normal case
where the policy file is tracked, but clean.  That requires looking up p1() to
read the file.  I don't see any way to get the content of one file without first
creating the full parent context.

I'm a bit puzzled by the way eol handles this.  It loads the file for p1() in
a preupdate hook (what if the update fails?).  It also directly reads the
filesystem in cases where it wants to examine 'repo[None]' (what if the file is
deleted or not tracked?).  There's more for me to figure out here, but I wanted
to float this trial balloon now, in case I'm off in the weeds.  It would be nice
to land this functionality before the freeze.

diff --git a/hgext/lfs/__init__.py b/hgext/lfs/__init__.py
--- a/hgext/lfs/__init__.py
+++ b/hgext/lfs/__init__.py
@@ -52,6 +52,7 @@
 from mercurial import (
     bundle2,
     changegroup,
+    config,
     context,
     exchange,
     extensions,
@@ -123,13 +124,21 @@
     if not repo.local():
         return
 
-    repo.svfs.options['lfstrack'] = _trackedmatcher(repo)
     repo.svfs.lfslocalblobstore = blobstore.local(repo)
     repo.svfs.lfsremoteblobstore = blobstore.remote(repo)
 
     # Push hook
     repo.prepushoutgoinghooks.add('lfs', wrapper.prepush)
 
+    class lfsrepo(repo.__class__):
+        @localrepo.unfilteredmethod
+        def commitctx(self, ctx, error=False):
+            # TODO: Ensure this gets called for import
+            repo.svfs.options['lfstrack'] = _trackedmatcher(self, ctx)
+            return super(lfsrepo, self).commitctx(ctx, error)
+
+    repo.__class__ = lfsrepo
+
     if 'lfs' not in repo.requirements:
         def checkrequireslfs(ui, repo, **kwargs):
             if 'lfs' not in repo.requirements:
@@ -149,18 +158,52 @@
         ui.setconfig('hooks', 'commit.lfs', checkrequireslfs, 'lfs')
         ui.setconfig('hooks', 'pretxnchangegroup.lfs', checkrequireslfs, 'lfs')
 
-def _trackedmatcher(repo):
+def _trackedmatcher(repo, ctx):
     """Return a function (path, size) -> bool indicating whether or not to
     track a given file with lfs."""
-    trackspec = repo.ui.config('lfs', 'track')
+    data = ''
+
+    if '.hglfs' in ctx.added() or '.hglfs' in ctx.modified():
+        data = ctx['.hglfs'].data()
+    elif '.hglfs' not in ctx.removed():
+        p1 = repo['.']
+
+        if '.hglfs' not in p1:
+            # No '.hglfs' in wdir or in parent.  Fallback to config
+            # for now.
+            trackspec = repo.ui.config('lfs', 'track')
+
+            # deprecated config: lfs.threshold
+            threshold = repo.ui.configbytes('lfs', 'threshold')
+            if threshold:
+                fileset.parse(trackspec)  # make sure syntax errors are confined
+                trackspec = "(%s) | size('>%d')" % (trackspec, threshold)
+
+            return minifileset.compile(trackspec)
+
+        data = p1['.hglfs'].data()
 
-    # deprecated config: lfs.threshold
-    threshold = repo.ui.configbytes('lfs', 'threshold')
-    if threshold:
-        fileset.parse(trackspec)  # make sure syntax errors are confined
-        trackspec = "(%s) | size('>%d')" % (trackspec, threshold)
+    # In removed, or not in parent
+    if not data:
+        return lambda p, s: False
+
+    # TODO: catch parse errors and block commit.  Bad files won't block access
+    # to data, but there's no reason to allow a commit if the user made a bad
+    # attempt at setting a policy.
+    cfg = config.config()
+    cfg.parse('.hglfs', data)
 
-    return minifileset.compile(trackspec)
+    rules = [(minifileset.compile(pattern), minifileset.compile(rule))
+             for pattern, rule in cfg.items('track')]
+
+    def _match(path, size):
+        for pat, rule in rules:
+            if pat(path, size):
+                return rule(path, size)
+
+        return False
+
+    return _match
 
 def wrapfilelog(filelog):
     wrapfunction = extensions.wrapfunction
diff --git a/tests/test-lfs.t b/tests/test-lfs.t
--- a/tests/test-lfs.t
+++ b/tests/test-lfs.t
@@ -937,6 +937,57 @@
   $ hg commit -m 'add A' -A A
   $ hg rm A
   $ hg commit -m 'rm A'
+
+XXX: '**' works out to mean all files.  It looks reasonable here, but I hadn't
+really considered it when implementing the minifileset language.  Good idea?
+
+  $ cat >> .hglfs << EOF
+  > [track]
+  > **.test = size(">5B")
+  > **.exclude = none()
+  > ** = size(">10B")
+  > EOF
+
+The LFS policy takes effect as the .hglfs file is committed
+
+  $ echo 'largefile' > lfs.test
+  $ echo '012345678901234567890' > nolfs.exclude
+  $ echo '01234567890123456' > lfs.catchall
+  $ hg ci -Aqm 'added .hglfs'
+  $ hg log -r . -T '{rev}: {lfs_files % "{lfs_file}: {oid}\n"}\n'
+  2: lfs.catchall: d4ec46c2869ba22eceb42a729377432052d9dd75d82fc40390ebaadecee87ee9
+  lfs.test: 5489e6ced8c36a7b267292bde9fd5242a5f80a7482e8f23fa0477393dfaa4d6c
+  
+The existing .hglfs file is used even when it is not in the 'A' or 'M' states
+
+  $ echo 'largefile2' > lfs.test
+  $ echo '012345678901234567890a' > nolfs.exclude
+  $ echo '01234567890123456a' > lfs.catchall
+  $ hg ci -qm 'unmodified .hglfs'
+  $ hg log -r . -T '{rev}: {lfs_files % "{lfs_file}: {oid}\n"}\n'
+  3: lfs.catchall: 31f43b9c62b540126b0ad5884dc013d21a61c9329b77de1fceeae2fc58511573
+  lfs.test: 8acd23467967bc7b8cc5a280056589b0ba0b17ff21dbd88a7b6474d6290378a6
+  
+Excluding the .hglfs file from the commit postpones the policy change
+
+  $ hg rm .hglfs
+  $ echo 'largefile3' > lfs.test
+  $ echo '012345678901234567890abc' > nolfs.exclude
+  $ echo '01234567890123456abc' > lfs.catchall
+  $ hg ci -qm 'file test' -X .hglfs
+  $ hg log -r . -T '{rev}: {lfs_files % "{lfs_file}: {oid}\n"}\n'
+  4: lfs.catchall: 6747cfb1b83965b4a884e7a6061813ae31e4122028bc6a88d2ac5e5f9e05c5af
+  lfs.test: 3f40b70c2294e91e0fa789ebcf73c5a1d1c7aef270f83e477e40cb0513237e8c
+  
+The policy change takes effect when the .hglfs is committed
+
+  $ echo 'largefile4' > lfs.test
+  $ echo '012345678901234567890abcdef' > nolfs.exclude
+  $ echo '01234567890123456abcdef' > lfs.catchall
+  $ hg ci -qm 'file test'
+  $ hg log -r . -T '{rev}: {lfs_files % "{lfs_file}: {oid}\n"}\n'
+  5: 
+
   $ cd ..
 
 Unbundling adds a requirement to a non-lfs repo, if necessary.


More information about the Mercurial-devel mailing list