[PATCH] convert: support glob patterns to exclude/include files

Fri Mar 12 14:45:37 CST 2010

Le 07/03/10 22:21, Tessa Starkey a écrit :
> # HG changeset patch
> # User Tessa Starkey <testarkey at gmail.com>
> # Date 1267939187 18000
> # Node ID b894d8a84bef5dd0702e41c9f7959a62f7cbd9f6
> # Parent  efd3b71fc29315e79a29033fdd0d149b309eb398
> convert: support glob patterns to include/exclude files
> 
> Implemented by replacing the existing filename-matching logic and using
> matcher from the main match module instead. This was done to avoid
> re-implementing glob matching.

Looks interesting.

Unfortunately, the current implementation is full of traps, see below for more details. I second everything Greg said and will add some light to the regexp: part, which in in fact re:.

> diff --git a/hgext/convert/filemap.py b/hgext/convert/filemap.py
> --- a/hgext/convert/filemap.py
> +++ b/hgext/convert/filemap.py
> @@ -6,7 +6,7 @@
>  
>  import shlex
>  from mercurial.i18n import _
> -from mercurial import util
> +from mercurial import util, match
>  from common import SKIPREV, converter_source
>  
>  def rpairs(name):
> @@ -21,17 +21,20 @@
>      A name can be mapped to itself, a new name, or None (omit from new
>      repository).'''
>  
> -    def __init__(self, ui, path=None):
> +    def __init__(self, ui, root,  path=None):
>          self.ui = ui
> -        self.include = {}
> -        self.exclude = {}
> +        self.root = root
>          self.rename = {}
> +        self.includematch = []
> +        self.excludematch = []
>          if path:
>              if self.parse(path):
>                  raise util.Abort(_('errors in filemap'))
>  
>      def parse(self, path):
>          errs = 0
> +        include = []
> +        exclude = []
>          def check(name, mapping, listname):
>              if name in mapping:
>                  self.ui.warn(_('%s:%d: %r already in %s list\n') %

Too bad we cannot see it in the diff, but right there we split the expressions with shlex module in posix mode. It means posix shell delimiting and escaping rules are used here to split the command tokens. The lex.wordchars set is changed so the parsing is harmless to 99% of regular filenames and probably glob patterns too. But if I specify a regexp matcher with the 're:' or 'relre:' prefixes things are different:

    include "foobar\d+"

is parsed as expected, the expression is preserved. But:

    include foobar\d+

is parsed as ('include', 'foobard+') thanks to posix escaping rules with regard to double quotes.

It means we have to be very careful with documentation, and we might want to add some kind of debugging option to display parsed expressions (which would be useful even without this feature).

> @@ -44,17 +47,17 @@
>          while cmd:
>              if cmd == 'include':
>                  name = lex.get_token()
> -                errs += check(name, self.exclude, 'exclude')
> -                self.include[name] = name
> +                errs += check(name, exclude, 'exclude')
> +                self.includematch.append(match.match(self.root, '', [name]))

1- Why do you need self.root here? I think you can pass ''. The source path is usually meaningless, input paths will be relative to sink root.

2- Reusing match.match() is a good idea with one problem, for some reason glob and relpath expressions are filtered through util.canonpath() which forbids anything containing a path component equal to '.hg'. Rerun your tests with an additional rule like:

    exclude .hg/foobar

It will fail to start. This might looks a completely uninteresting case, except a filemap feature is being added to hgsubversion just to filter out such a directory from Jython repository...

I am not saying you should forget match.match() for this but we need something to avoid this particular behaviour. I think such an option existed in the past.

--
Patrick Mézard