[PATCH 5 of 6 V2] match: avoid translating glob to matcher multiple times for large sets

Wed Nov 28 10:41:13 EST 2018

On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
> On Fri, Nov 23, 2018 at 9:20 AM Boris FELD <boris.feld at octobus.net
> <mailto:boris.feld at octobus.net>> wrote:
>
>
>     On 23/11/2018 10:24, Yuya Nishihara wrote:
>     > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>     >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via
>     Mercurial-devel wrote:
>     >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>     >>> martinvonz at google.com <mailto:martinvonz at google.com>> wrote:
>     >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld
>     <boris.feld at octobus.net <mailto:boris.feld at octobus.net>> wrote:
>     >>>>
>     >>>>> # HG changeset patch
>     >>>>> # User Boris Feld <boris.feld at octobus.net
>     <mailto:boris.feld at octobus.net>>
>     >>>>> # Date 1542916922 -3600
>     >>>>> #      Thu Nov 22 21:02:02 2018 +0100
>     >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>     >>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>     >>>>> # EXP-Topic perf-ignore-2
>     >>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>     >>>>> #              hg pull
>     https://bitbucket.org/octobus/mercurial-devel/ -r
>     >>>>> 018578f3ab59
>     >>>>> match: avoid translating glob to matcher multiple times for
>     large sets
>     >>>>>
>     >>>>> For hgignore with many globs, the resulting regexp might not
>     fit under
>     >>>>> the 20K
>     >>>>> length limit. So the patterns need to be broken up in
>     smaller pieces.
>     >>>>>
>     >>>> Did you see 0f6a1bdf89fb (match: handle large regexes,
>     2007-08-19)
>     >>>> and 59a9dc9562e2 (ignore: split up huge patterns,
>     2008-02-11)? It might be
>     >>>> worth trying to figure out what Python versions those commits
>     are talking
>     >>>> about. Maybe we've dropped support for those versions and we
>     can simplify
>     >>>> this code.
>     >>>>
>     >>> Oh, and what made me do the archaeology there was that you
>     seem to have
>     >>> lost the handling of OverlowError from the regex engine. As I
>     said above, I
>     >>> suspect that's fine because we no longer support some very old
>     Python
>     >>> versions (but please try to figure out what version that
>     refers to). Still,
>     >>> if we decide to drop that OverflowError handling, I'd prefer
>     to see that in
>     >>> an explicit commit early in this series.
>     To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>     59a9dc9562e2 (cannot trust the engine, preemptively raise our own
>     error).
>
>
> Yes, perhaps (if it was only expressions longer than 20k that raised
> OverflowError). My point was that if that was the case, we should
> rewrite to avoid using an internal exception for flow control, i.e.
> change from this:
>
>     try:
>         regex = # create regex
>         if len(regex) > MAX_RE_SIZE:
>             raise OverflowError
>         return regex, _rematcher(regex)
>     except OverflowError:
>         # break up into smaller
>
> to this:
>
>     regex = # create regex
>     if len(regex) < MAX_RE_SIZE:
>         return regex, _rematcher(regex)
>     # break up into smaller
I don't disagree with the point. However, I do not know why it is
relevant to the current discussion. The series in review is no longer
using exception for flow control. What am I missing?
>
>  
>
>
>     So I feel like it is fine to just rely on the size limit.
>     >> Perhaps it's been fixed since 2.7.4. The regexp code width is
>     extended
>     >> from 16bit to 32bit (or Py_UCS4) integer. That should be large
>     enough to
>     >> handle practical patterns.
>     >>
>     >> https://bugs.python.org/issue1160
>
>     Thanks for digging this out. It looks like we may be able to drop this
>     limit altogether. However, I would like to make it a change distinct
>     from this series.
>
>     The current code is very problematic for some people (to the point
>     where
>     the majority of `hg status` time is spent in that function). I would
>     like to get fast code for the same semantic first. Then look into
>     changing the semantic.
>
>
> Is your concern that you might regress in performance of something by
> changing how large the groups are? Or that it would be more work?

My concern is that we are currently delaying a simple code change
yielding a huge improvement because there is the option for another
larger, riskier one.

I'm not dismissing your finding about lifting the limit and avoiding the
regexps joining. They look very promising and we are planning to explore
them. However, they are a different change.

The current series is preserving all the assumption we have been using
for the past 10 years, implementing a faster code for the same result.
This is a very low risk and solves the most immediate problem of a user
group we support.

Dropping the limit is a different matter. It is probably harmless to do
so, but Mercurial runs on a variety of systems and regexp engines. We'll
have to look at all the case to make sure our new assumption is valid.
The same goes for stopping the grouping, we'll have to check the
behavior on various real-life use cases to make sure it is valid.

We are happy to explore these two ideas ourselves, but I would prefer to
see the faster grouping in first. This is a good intermediate step that
improves the current situation.

In addition, we won't have direct access to the real-life repository
until in about two months, so it will hard for us to check the timing of
a significantly new approach on real-life data.

>
> I tried creating a regex for *every* pattern and that actually seemed
> faster (to my surprise), both when creating the matcher and when
> evaluating it. I tried it on the mozilla-unified repo both with 1k
> files and with 10k files in the hgignores. I used the following patch
> on top of your series.
Could you share more details on the patterns you generated, the
operation you performed and how you gathered the timings?
>
> diff --git a/mercurial/match.py b/mercurial/match.py
> --- a/mercurial/match.py
> +++ b/mercurial/match.py
> @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
>      else:
>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
> -MAX_RE_SIZE = 20000
> -_BASE_SIZE = len('(?:)') - 1
> -
> -def _joinregexes(regexps):
> -    """gather multiple regular expressions into a single one"""
> -    return '(?:%s)' % '|'.join(regexps)
> -
>  def _buildregexmatch(kindpats, globsuffix):
>      """Build a match function from a list of kinds and kindpats,
>      return regexp string and a matcher function.
> -
> -    Test too large input
> -    >>> _buildregexmatch([
> -    ...     ('relglob', '?' * MAX_RE_SIZE, '')
> -    ... ], '$')
> -    Traceback (most recent call last):
> -    ...
> -    Abort: matcher pattern is too long (20009 bytes)
>      """
>      try:
> -        allgroups = []
>          regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
> -        fullregexp = _joinregexes(regexps)
> -
> -        startidx = 0
> -        groupsize = _BASE_SIZE
> -        for idx, r in enumerate(regexps):
> -            piecesize = len(r)
> -            if (piecesize + 4) > MAX_RE_SIZE:
> -                msg = _("matcher pattern is too long (%d bytes)") %
> piecesize
> -                raise error.Abort(msg)
> -            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
> -                group = regexps[startidx:idx]
> -                allgroups.append(_joinregexes(group))
> -                startidx = idx
> -                groupsize = _BASE_SIZE
> -            groupsize += piecesize + 1
> -
> -        if startidx == 0:
> -            func = _rematcher(fullregexp)
> -        else:
> -            group = regexps[startidx:]
> -            allgroups.append(_joinregexes(group))
> -            allmatchers = [_rematcher(g) for g in allgroups]
> -            func = lambda s: any(m(s) for m in allmatchers)
> +        fullregexp = '(?:%s)' % '|'.join(regexps)
> +        allmatchers = [_rematcher(r) for r in regexps]
> +        func = lambda s: any(m(s) for m in allmatchers)
>          return fullregexp, func
>      except re.error:
>          for k, p, s in kindpats:
>
>
>  
>
>
>     > That said, combining more chunks of regex patterns might be
>     likely to
>     > lead to another funny problem.
>     >
>     > % python -c 'import re; re.compile("(a)" * 100)'
>     > Traceback (most recent call last):
>     >   File "<string>", line 1, in <module>
>     >   File "/usr/lib/python2.7/re.py", line 194, in compile
>     >     return _compile(pattern, flags)
>     >   File "/usr/lib/python2.7/re.py", line 249, in _compile
>     >     p = sre_compile.compile(pattern, flags)
>     >   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>     >     "sorry, but this version only supports 100 named groups"
>     > AssertionError: sorry, but this version only supports 100 named
>     groups
>     >
>     > It's unrelated to the OverflowError issue, but splitting
>     patterns could
>     > help avoiding the 100-named-group problem.
>
>     By chance, my current gigantic use case does not involve named groups.
>
>     Catching AssertionError, will be fun. I wish there were some clean API
>     to expose and check engine limitation.
>
>     > _______________________________________________
>     > Mercurial-devel mailing list
>     > Mercurial-devel at mercurial-scm.org
>     <mailto:Mercurial-devel at mercurial-scm.org>
>     > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20181128/8ffe9a60/attachment.html>