[PATCH 4 of 5 V4] match: avoid translating glob to matcher multiple times for large sets

Sun Dec 2 15:51:04 EST 2018

On Sun, Dec 2, 2018 at 7:26 AM Boris FELD <boris.feld at octobus.net> wrote:

>
> On 02/12/2018 01:38, Martin von Zweigbergk via Mercurial-devel wrote:
>
> I have a few more questions. Sorry I didn't notice last time (because I
> only considered the higher-level issues).
>
> On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld at octobus.net> wrote:
>
>> # HG changeset patch
>> # User Boris Feld <boris.feld at octobus.net>
>> # Date 1542916922 -3600
>> #      Thu Nov 22 21:02:02 2018 +0100
>> # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b
>> # Parent  066912081df7b43ed271bea4475e35661e16c1cf
>> # EXP-Topic perf-ignore
>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
>> 0a28ff17100a
>> match: avoid translating glob to matcher multiple times for large sets
>>
>> For hgignore with many globs, the resulting regexp might not fit under
>> the 20K
>> length limit. So the patterns need to be broken up in smaller pieces.
>>
>> Before this change, the logic was re-starting the full process from
>> scratch
>> for each smaller pieces, including the translation of globs into regexp.
>> Effectively doing the work over and over.
>>
>> If the 20K limit is reached, we are likely in a case where there is many
>> such
>> glob, so exporting them is especially expensive and we should be careful
>> not
>> to do that work more than once.
>>
>> To work around this, we now translate glob to regexp once and for all.
>> Then,
>> we assemble the resulting individual regexp into valid blocks.
>>
>> This raises a very significant performance win for large `.hgignore file`:
>>
>> Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median
>> of 66)
>> After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median
>> of 100)
>>
>> diff --git a/mercurial/match.py b/mercurial/match.py
>> --- a/mercurial/match.py
>> +++ b/mercurial/match.py
>> @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li
>>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>>
>>  MAX_RE_SIZE = 20000
>> +_BASE_SIZE = len('(?:)') - 1
>>
>
> I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is
> "private".
>
>
> MAX_RE_SIZE is a limit that impacts the external world (since RE larger
> than 20 000 are rejected). The other one is a pure implementation detail.
>
> I don't feel strongly about it and I'm fine aligning them one way or
> another if you feel like it.
>
>
> Also, why do you subtract 1 from BASE_SIZE?
>
> Because for N joined regexp there will be N-1 '|'. See lower for more
> details
>
>
>
>>
>>  def _joinregexes(regexps):
>>      """gather multiple regular expressions into a single one"""
>> @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi
>>      OverflowError
>>      """
>>      try:
>> -        regex = _joinregexes([_regex(k, p, globsuffix)
>> -                                     for (k, p, s) in kindpats])
>> -        if len(regex) <= MAX_RE_SIZE:
>> -            return regex, _rematcher(regex)
>> -        # We're using a Python with a tiny regex engine and we
>> -        # made it explode, so we'll divide the pattern list in two
>> -        # until it works
>> -        l = len(kindpats)
>> -        if l < 2:
>> -            # TODO: raise error.Abort here
>> -            raise OverflowError
>> -        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
>> -        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
>> -        return regex, lambda s: a(s) or b(s)
>> +        allgroups = []
>> +        regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
>> +        fullregexp = _joinregexes(regexps)
>> +
>> +        startidx = 0
>> +        groupsize = _BASE_SIZE
>> +        for idx, r in enumerate(regexps):
>> +            piecesize = len(r)
>>
>
> This size is excluding the unnamed group wrapping, right? (I'm not asking
> for a change here, just as context for later questions.)
>
> Yes, this is the pure size of the regexps produced by `_regex`.
>
>
> +            if (piecesize + 4) > MAX_RE_SIZE:
>>
>
> What's this 4? I think it's for the unnamed group wrapping, i.e.
> (BASE_SIZE + 1). Use that?
>
> Yes, another way to do this would be to avoid the `-1` on _BASE_SIZE and
> applying it explicitly in the `groupsize = _BASE_SIZE` lines (`groupsize =
> _BASE_SIZE + 1`)
>

Okay, as long as it's not wrong in this version of the patch, we can just
do that in a follow-up if we decide to do it at all. Queued, thanks.

>
>
>> +                raise OverflowError
>> +            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
>> +                group = regexps[startidx:idx]
>> +                allgroups.append(_joinregexes(group))
>> +                startidx = idx
>> +                groupsize = _BASE_SIZE
>> +            groupsize += piecesize + 1
>>
>
> I suppose the "1" is for the "|"-separator. Shouldn't we add (BASE_SIZE +
> 1) here too
>
> The 1 is for the |. we must not add _BASE_SIZE since there is only a
> single external grouping all joined regexp. _BASE_SIZE added once for each
> group.
>
> (and maybe redefine BASE_SIZE to include the additional 1)? Perhaps the
>
> Sentence seems to have been cut.
>

Oops, yes. (I got interrupted and set it in rush.)

>
> Would it be correct to initialize groupsize to 0 and add
>
> This sentence seems to have been cut too.
>
>
> +
>> +        if startidx == 0:
>> +            func = _rematcher(fullregexp)
>> +        else:
>> +            group = regexps[startidx:]
>> +            allgroups.append(_joinregexes(group))
>> +            allmatchers = [_rematcher(g) for g in allgroups]
>> +            func = lambda s: any(m(s) for m in allmatchers)
>> +        return fullregexp, func
>>      except re.error:
>>          for k, p, s in kindpats:
>>              try:
>>
>
> _______________________________________________
> Mercurial-devel mailing listMercurial-devel at mercurial-scm.orghttps://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20181202/e3f38644/attachment.html>