File Name Patterns Plan

FUJIWARA Katsunori foozy at lares.dti.ne.jp
Sun Dec 4 12:26:49 EST 2016


At Sat, 3 Dec 2016 03:44:06 +0100,
Pierre-Yves David wrote:
> 
> On 11/24/2016 08:22 PM, FUJIWARA Katsunori wrote:
> > At Thu, 24 Nov 2016 17:04:38 +0100,
> > Pierre-Yves David wrote:
> >>
> >> Recently, Foozy created a Plan page for the matcher issues:
> >>
> >> https://www.mercurial-scm.org/wiki/FileNamePatternsPlan
> >>
> >> It is a good start but there is a couple of elements that are
> >> missing or still a bit fuzzy to me.
> >
> > Thank you for comments !
> >
> > I'll investigate and update Wiki page later, to prevent my sleepy
> > brain from incorrectly thinking :-)
> >
> > This reply is just FYI about easy/clear points, to refer until
> > updating Wiki page, even though you may know them already.
> >
> > We should open new discussion thread citing some of points, which this
> > mail metions, in devel-ml ASAP, to avoid scattering discussion log
> > here and there, shouldn't we ?
> >
> >
> >> 1) Default matcher:
> >>
> >>    What is the default pattern mode ?
> >>
> >>    When one do `hg files FOO` how is FOO processed. It seems like to be
> >>    'relpath:'. Double checking this would be useful and mentioning it
> >>    on the page is important.
> >
> > Basically:
> >
> >   =================== ======== =========
> >                       (default)
> >   case                type     recursion
> >   =================== ======== =========
> >   -I/-X               glob:       o
> >   "glob:" in hgignore relglob:    o (*1)
> >   pattern in fileset  glob:       x (*2)
> >   other "glob:"       glob:       x (*2)
> >   otherwise           relpath:    o (*2) (*3)
> >   =================== ======== =========
> >
> > (*1) treated as "include" of match.match() internally
> > (*2) treated as "pats" of match.match() internally
> > (*3) usually, via scmutil.match() with default="relpath"
> >
> > But:
> >
> >> 2) Difference in command behavior:
> >>
> >>    There seems to be some commands behaving differently than other,
> >>    notably `hg locates` have some strange kind of
> >>    raw-non-recursive-any-rooted matching by default. It seems to go back to
> >>    'relpath:' when using -I
> >>
> >>    I wonder if there is other commands like this. It might be useful to
> >>    search for the default matcher on a command/flag basis.
> >
> > Oh, I overlooked that:
> >
> >   - "hg files" uses "relpath:" as default of scmutil.match(), but
> >   - "hg locate" uses "relglob:" explicitly
> >
> > (early commits introducing "hg files" may know why)
> 
> not really edf07a804ac41433e37d92a9809c6a9ec669c8ad is not very 
> talkative about it:
> 
>    files: add new command unifying locate and manifest functionality

I investigated more deeply.

'hg locate' has used 'relglob:' since 0.9.4 (or e8ee8fdeddb1).

  change locate to use relglobs by default
  https://www.mercurial-scm.org/repo/hg/rev/e8ee8fdeddb1

e8ee8fdeddb1 changed default pattern type of 'hg locate' from
'replpath:' to 'relglob:' to fix issues below.

  https://bz.mercurial-scm.org/show_bug.cgi?id=108
  https://bz.mercurial-scm.org/show_bug.cgi?id=204

** IF I DARE TO DEFEND using 'relglob:' for 'hg locate' by default **

At failure of matching:

  1. it is difficult to comprehensively extract "path" part
     from "regexp" mode pattern

  2. it is meaningless to show 'No such file ...' message for
     any-of-path types, because it is neither root-ed nor cwd-ed

  https://www.mercurial-scm.org/repo/hg/file/4.0/mercurial/match.py#l612

  ============= ======= ======== ===========
  mode          root-ed cwd-ed   any-of-path
  ============= ======= ======== ===========
  wildcard      ---     glob:    (relglob:)/2
  regexp        (re:)/1 ---      (relre:)/1,2
  raw string    path:   relpath: ---
  ============= ======= ======== ===========

Therefore, 'No such ....' message is shown only for 'glob:', 'path:'
and 'relpath:' patterns, IMHO.

If invocation context mainly focuses on:

  - the working directory (= existing files):

    Showing 'No such ....' message for unmatched pattern seems useful.

    In addition to it, wildcard in the pattern without 'glob:' prefix
    for EXISTING files/directories may be already expanded by shell
    before expansion by Mercurial :-)

  - history information (or metadata of repo):

    Showing 'No such ....' message seems useless.

According to comments in issues above, 'hg locate' seems to focus
mainly on history information rather than the working directory, even
though it refers manifest of the parent of the working directory
(early 'hg locate' seems to be expected to work as 'hg manifest | grep
PATTERN').

From point of this view, using 'relglob:' by default isn't so strange
for 'hg locate' :-)


> The fact 'hg locate' used a different pattern-type by default might have 
> been one of the motivation. I wonder is there is other "inconsistent" 
> extension.

I found some other "default type" variations in Mercurial core.

    https://www.mercurial-scm.org/wiki/FileNamePatternsPlan#The_list_of_contexts.2C_in_which_pattern_is_specified

Types other than 'relpath:' (default of scmutil.match()) are used in
cases below:

  - glob: for
    - fileset
      assuming that patterns use wildcard in many cases for fileset ?

    - diff() template function (using patterns as include/exclude)
    - files() template function
      There is no another template function, which tries pattern
      matching

    - file() revset predicate
      Other predicates contains(), filelog(), adds(), modified() and
      removes() use 'relpath:' as default type

    - --include/--exclude

  - path: for
    - 'follow()' revset predicate
    - 'archive' web command

  - relre: for .hgignore
    OK, historical reason always wins :-)

  - relglob: for 'hg locate'
    see my guess above


> >> 3) Recursion behavior,
> >>
> >>     There is some data about this in the page, but I think we need more
> >>     formal representation to have a clear view of the situation.
> >>
> >>     The existing 'path:' and 'relpath:' are recursive in all cases,
> >>     while 'glob:' and 're:' variants are only recursive with -I/-E.
> >>     This is a key point because as far as I understand fixing this is a
> >>     core goal of the current plan.
> >
> >   while 'glob:' variants are only recursive with -I/-X
> >
> >   ('re:' is always recursive)
> 
> Gah… So we have:
> 
>   path → recursive
>   glob → not recursive (because you can make it recursive using '**")
>   re → not recursive (because you can make it nont recursive using '$')

Nit picking again :-)

  re → ___ recursive (because you can make it non recursive using '$')

BTW, for hgignore, "regexp" types are recursive, even if pattern ends
with '$' (I overlooked this, before :-<)

  https://www.mercurial-scm.org/wiki/FileNamePatternsPlan#Recursion_of_ignore_patterns

This is only one exception of "regexp" recursion.

> >
> >>     However, Foozy point out that using 'set:' with -I disable the
> >>     automatic recursion for 're' and 'glob', but not for 'path', so we
> >>     have more "variants" here.
> >
> >   using 'set:' with -I disable the automatic recursion for 'glob', but
> >   not for 're' and 'path'
> >
> >   ('re:' is always recursive)
> 
> But in that context we can use '$' to disable the recursion ?

Yes, '$' disable recursion of regexp patterns. But 'set:' itself
doesn't.

Let me re-summarize about recursion (= matching against intermediate
directory) of each modes.

  ============ ================== ================== =================
  mode         -I/-X              in 'set:'          -I/-X with 'set:'
  ============ ================== ================== =================
  wildcard     always             endswith("**")     endswith("**")
  regexp       not endswith("$")  not endswith("$")  not endswith("$")
  raw string   always             always             always
  ============ ================== ================== =================


> >>     (bonus point: Rodrigo use case can we fulfilled by adding 'set:' to
> >>     his selector.)
> >>
> >>     I also wonder if there is other variants than "pattern", "-I" and
> >>     "-I + set:".
> >>
> >>     Having a table with 'pattern-type / usage' listing the recursive
> >>     case would probably be a good start.
> >
> > I'll investigate.
> >
> >
> >> 4) Reading from file,
> >>
> >>    Foozy mention the pattern name in some file (hgignore) does not
> >>    match pattern name on the command line.
> >>
> >>    I think it would be useful to be a bit more formal here. What kind
> >>    of file do we read pattern from? Do we have difference from 1 file
> >>    to another? what are the translation (and default), etc.
> >
> > match.readpatternfile() substitutes pattern-type in files read in.
> >
> >     glob => relglob
> >     re   => relre
> >
> >     https://www.mercurial-scm.org/repo/hg/file/4.0/mercurial/match.py#l666
> >
> > In Mercurial core, .hgignore (and files indirectly included by it or
> > hgrc) is only one case.
> 
> We should probably make sure the wiki page contains this.

Sorry, I forgot describing about difference between 'include:' and
'listfile:'.

I add explanation about "Reading patterns from file" to wiki page.

  https://www.mercurial-scm.org/wiki/FileNamePatternsPlan#Reading_patterns_from_file


> >> 5) Pattern-type table
> >>
> >>    Foozy made many table explaining how variants are covered by
> >>    pattern type. Having a pattern centric summary will be useful.
> >>
> >>    Proposal for columns:
> >>
> >>    * pattern type;
> >>    * from cli or file;
> >>    * matching mode (raw, glob, or re),
> >>    * rooting (root, cwd or any),
> >>    * recursive when used as Pattern
> >>    * recursive when used with -I
> >>
> >>    Having the same table for the proposed keyword would help to
> >>    understand inconsistency and similarity with
> >
> > I'll update Wiki page.
> >
> >
> >> 6) file:/dir:
> >>
> >>    I'm a bit confused here because Mercurial does not really track/work
> >>    on directories. What is is benefit of 'dir:' ? 'dir:' seems very
> >>    similar to 'path' am I missing something important?
> >>
> >>    As I understand 'file:' could be useful for the non-recursive
> >>    part if we want to cover every single cases. Am I right?
> >
> > Yes, 'file:' is used for strict non-recursive matching. 'dir:' is
> > listed as opposite of 'file:', for coverage :-)
> >
> > I have only one example usecase for "dir:". If file and directory
> > names collide each other at merging, all commits related not to file
> > FOO but files under directory FOO can be checked by:
> >
> >     $ hg log -r "file('path:FOO') and not file('file:FOO')"
> >     $ hg log -r "file('dir:FOO')"
> >
> > Theefore, I don't have strong opinion to implement 'dir:' itself.
> 
> It seems like the 'collision' case could be covered by a fileset 
> (surely, we have a fileset to distinct between file and dir, right?)

Would you mean that introducing 'file:' and 'dir:' allows user to
confirm whether merging causes collision between files and directories
without actual merging ? Yes, if so (even thouhg it isn't so efficient
for internal use).


> >> 7) compatibility conclusion
> >>
> >>    Getting a whole new set of matcher is a big step that have a
> >>    significant confusion step, we have to get it right
> >>
> >>    We cannot change the default behavior (raw string) and this is what
> >>    people will find the most. So we have to be careful about
> >>    inconsistency here because we cannot change the behavior of this
> >>    current default. For example it is probably better that all the new
> >>    matcher very consistent with each other and that the behavior
> >>    mismatch between raw and the new official one is simple to grasp.
> >>
> >>    In the same way, I do not think we'll be able to alias the old
> >>    pattern-type to the new-ones. Because we cannot fix recursion
> >>    behavior of the old ones.
> >>    There will be online material with the old one and we won't be able
> >>    to fix them. This is a lesser issue but we should probably keep it
> >>    in mind. (Without any serious backing I expect that pattern for
> >>    hgignore are probably the most documented online).
> >
> > I think that existing (= legacy) "glob:" can be implemented as an
> > alias of new systematic pattern-type WITH "additional suffix"
> > controlling recursion of matching ("relglob:" can be so, similarly)
> >
> >   ================== ======== ========= ========= ==============
> >   case               type     recursion alias of  additional suffix
> >   ================== ======== ========= ========= ==============
> >   -I/-X              glob:       o      cwdglob:   (?:/|$)
> >   "glob" in hgignore relglob:    o      anyglob:   (?:/|$)
> >   pattern in fileset glob:       x      cwdglob:   $
> >   other "glob"       glob:       x      cwdglob:   $
> >   ================== ======== ========= ========= ==============
> 
> So if I understand this correctly, you means that old pattern-type can 
> be implemented as a "smart-alias" to the new pattern-type. These 
> "smart-alias" will add appropriate suffix-preffix to the pattern before 
> calling the new code ?

Yes.

Current match.py implementation adds prefix/suffix regexp below to the
specified pattern internally, according to what it is used for. See
implementation of _regex() and match._normalize(), and _buildmatch()
invocations in match.__init__() in match.py, for detail.

  =========== =============== =========== ========== =========
  type        used for        prefix      suffix     recursive
  =========== =============== =========== ========== =========
  `glob:`     pattern         "$CWD/"     "$"        endswith("**")
              include/exclude "$CWD/"     "(?:/|$)"  always

  `relglob:`  pattern         "(?:|.*/)"  "$"        endswith("**")
              include/exclude "(?:|.*/)"  "(?:/|$)"  always

  `re:`       (always)        (none)      (none)     not endswith("$")
  `relre:`    (always)        ".*" (*1)   (none)     not endswith("$")

  `path:`     (always)        "^" (*2)    "(?:/|$)"  always
  `relpath:`  (always)        "$CWD/"     "(?:/|$)"  always
  =========== =============== =========== ========== =========

  (*1) add this prefix, only if pattern doesn't start with "^"

  (*2) (just nit picking) this may be redundant, because patterns are
       examined by "re.match()", which requires matching from the
       beginning of a target string.

So, at first, let newly introduced types use additional prefix/suffix
regexp below BY DEFAULT (now, controlling recursion in "wildcard" and
"regexp" mode is user responsibility).

  =========== ============ =========== =======================
  type        prefix       suffix      recursive
  =========== ============ =========== =======================
  `rootglob:` (none)       "$"         endswith("**")
  `cwdglob:`  "$CWD/"      "$"         endswith("**")
  `anyglob:`  "(?:|.*/)"   "(?:/|$)"   always

  `rootre:`   (none)       (none)      not endswith("$")
  `cwdre:`    "$CWD/"      (none)      not endswith("$")
  `anyre:`    ".*"         (none)      not endswith("$")

  `rootpath:` (none)       "(?:/|$)"   always
  `cwdpath:`  "$CWD/"      "(?:/|$)"   always
  `anypath:`  "(?:|.*/)"   "(?:/|$)"   always
  =========== ============ =========== =======================

Then, legacy types can be emulated as an alias of newly introduced
type as below:

  =========== =============== =========== ===================
  type        used as         alias of    needed suffix
  =========== =============== =========== ===================
  `glob:`     pattern         `cwdglob:`  "$" (= default of `cwdglob:`)
              include/exclude `cwdglob:`  "(?:/|$)"

  `relglob:`  pattern         `anyglob:`  "$"
              include/exclude `anyglob:`  "(?:/|$)" (= default of `anyglob:`)

  `re:`       (always)        `rootre:`   (none) (= default of `rootre:`)
  `relre:`    (always)        `anyre:`    (none) (= default of `anyre:`)

  `path:`     (always)        `rootpath:` "(?:/|$)" (= default of `rootpath:`)
  `relpath:`  (always)        `cwdpath:`  "(?:/|$)" (= default of `cwdpath:`)
  =========== =============== =========== ===================

At this point, using suffix below forcibly for legacy `glob:` and
`relglob:` is as same as current match.py implementation.

  - "$" for pattern
  - "(?:/|$)" for include/exclude

Therefore, aliasing should be emulated easily.


> > New systematic "*glob:" family doesn't match recursively, unless "**"
> > is specified at the end of pattern. Therefore, extra explanation about
> > recursion is needed only for "glob:" via -I/-X and hgignore.
> >
> > (sorry, if I misunderstand your suggestion)
> 
> I do not really have a suggestion for exact behavior here. Beside having 
> a giant table with all the data about the existing and planned 
> pattern-type so that we can check how consistent the new plan is.
> 
> Cheers
> 
> -- 
> Pierre-Yves David
> 

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp


More information about the Mercurial-devel mailing list