[PATCH 2 of 3] lfs: add a small language to filter files

Sun Jan 7 03:17:12 EST 2018

On Thu, 04 Jan 2018 23:58:55 -0500, Matt Harbison wrote:
> # HG changeset patch
> # User Matt Harbison <matt_harbison at yahoo.com>
> # Date 1514704880 18000
> #      Sun Dec 31 02:21:20 2017 -0500
> # Node ID 8c20ade835ce43441c61e56e63d9bf92deaacd55
> # Parent  2798cb4faacdae2db46e84ba0f3beaf506848915
> lfs: add a small language to filter files
> 
> This patch was authored by Jun Wu for the fb-experimental repo, to avoid using
> matcher for efficiency[1].  All I've changed here is the package (hgext3rd ->
> hgext), and fixed up the imports in the test file (use absolute_import,
> print_function, and 'from lfs import ...' -> 'from hgext.lfs import...').
> 
> We want a way to specify what files to be converted to LFS at commit time.
> And per discussion, we also want to specify what files to skip text diff or
> merge in another config option. The current `lfs.threshold` config option
> could not satisfy complex needs.
> 
> This diff adds a small language for that. It's self-explained, and deals
> with both simple and complex cases. For example:
> 
>   always                 # everything
>   >20MB                  # larger than 20MB
>   !.txt                  # except for .txt files
>   .zip | .tar.gz | .7z   # some types of compressed files
>   /bin                   # files under "bin" in the project root
>   (.php & >2MB) | (.js & >5MB) | .tar.gz | (/bin & !/bin/README) | >1GB
> 
> [1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-December/109387.html

Can't we make it a subset of the fileset language so we can eventually switch
to it if O(n) issue is solved?

i.e. _compile() the result of fileset.parse(), but abort if unsupported element
found.

> +def _tokenize(text):
> +    text = memoryview(text) # make slice zero-copy
> +    special = ' ()&|!'
> +    pos = 0
> +    l = len(text)
> +    while pos < l:
> +        symbol = ''.join(itertools.takewhile(lambda ch: ch not in special,
> +                                             text[pos:]))
> +        if symbol:
> +            yield ('symbol', symbol, pos)
> +            pos += len(symbol)
> +        else: # special char
> +            if text[pos] != ' ': # ignore space silently
> +                yield (text[pos], None, pos)

Taking anything other than specials as symbol means we can't extend the
language.