[PATCH 4 of 5 V4] match: avoid translating glob to matcher multiple times for large sets

Sat Dec 1 19:38:34 EST 2018

I have a few more questions. Sorry I didn't notice last time (because I
only considered the higher-level issues).

On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld at octobus.net> wrote:

> # HG changeset patch
> # User Boris Feld <boris.feld at octobus.net>
> # Date 1542916922 -3600
> #      Thu Nov 22 21:02:02 2018 +0100
> # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b
> # Parent  066912081df7b43ed271bea4475e35661e16c1cf
> # EXP-Topic perf-ignore
> # Available At https://bitbucket.org/octobus/mercurial-devel/
> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> 0a28ff17100a
> match: avoid translating glob to matcher multiple times for large sets
>
> For hgignore with many globs, the resulting regexp might not fit under the
> 20K
> length limit. So the patterns need to be broken up in smaller pieces.
>
> Before this change, the logic was re-starting the full process from scratch
> for each smaller pieces, including the translation of globs into regexp.
> Effectively doing the work over and over.
>
> If the 20K limit is reached, we are likely in a case where there is many
> such
> glob, so exporting them is especially expensive and we should be careful
> not
> to do that work more than once.
>
> To work around this, we now translate glob to regexp once and for all.
> Then,
> we assemble the resulting individual regexp into valid blocks.
>
> This raises a very significant performance win for large `.hgignore file`:
>
> Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median
> of 66)
> After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median
> of 100)
>
> diff --git a/mercurial/match.py b/mercurial/match.py
> --- a/mercurial/match.py
> +++ b/mercurial/match.py
> @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li
>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
>  MAX_RE_SIZE = 20000
> +_BASE_SIZE = len('(?:)') - 1
>

I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is
"private".

Also, why do you subtract 1 from BASE_SIZE?

>
>  def _joinregexes(regexps):
>      """gather multiple regular expressions into a single one"""
> @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi
>      OverflowError
>      """
>      try:
> -        regex = _joinregexes([_regex(k, p, globsuffix)
> -                                     for (k, p, s) in kindpats])
> -        if len(regex) <= MAX_RE_SIZE:
> -            return regex, _rematcher(regex)
> -        # We're using a Python with a tiny regex engine and we
> -        # made it explode, so we'll divide the pattern list in two
> -        # until it works
> -        l = len(kindpats)
> -        if l < 2:
> -            # TODO: raise error.Abort here
> -            raise OverflowError
> -        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
> -        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
> -        return regex, lambda s: a(s) or b(s)
> +        allgroups = []
> +        regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
> +        fullregexp = _joinregexes(regexps)
> +
> +        startidx = 0
> +        groupsize = _BASE_SIZE
> +        for idx, r in enumerate(regexps):
> +            piecesize = len(r)
>

This size is excluding the unnamed group wrapping, right? (I'm not asking
for a change here, just as context for later questions.)

+            if (piecesize + 4) > MAX_RE_SIZE:
>

What's this 4? I think it's for the unnamed group wrapping, i.e.
(BASE_SIZE + 1). Use that?

> +                raise OverflowError
> +            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
> +                group = regexps[startidx:idx]
> +                allgroups.append(_joinregexes(group))
> +                startidx = idx
> +                groupsize = _BASE_SIZE
> +            groupsize += piecesize + 1
>

I suppose the "1" is for the "|"-separator. Shouldn't we add (BASE_SIZE +
1) here too (and maybe redefine BASE_SIZE to include the additional 1)?
Perhaps the

Would it be correct to initialize groupsize to 0 and add

+
> +        if startidx == 0:
> +            func = _rematcher(fullregexp)
> +        else:
> +            group = regexps[startidx:]
> +            allgroups.append(_joinregexes(group))
> +            allmatchers = [_rematcher(g) for g in allgroups]
> +            func = lambda s: any(m(s) for m in allmatchers)
> +        return fullregexp, func
>      except re.error:
>          for k, p, s in kindpats:
>              try:
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20181201/2b70de6a/attachment.html>