[PATCH 5 of 6 V2] match: avoid translating glob to matcher multiple times for large sets

Sat Dec 1 12:05:55 UTC 2018

On 28/11/2018 19:30, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
> On Wed, Nov 28, 2018 at 7:41 AM Boris FELD <boris.feld at octobus.net
> <mailto:boris.feld at octobus.net>> wrote:
>
>
>     On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>>
>>
>>     On Fri, Nov 23, 2018 at 9:20 AM Boris FELD
>>     <boris.feld at octobus.net <mailto:boris.feld at octobus.net>> wrote:
>>
>>
>>         On 23/11/2018 10:24, Yuya Nishihara wrote:
>>         > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>>         >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk
>>         via Mercurial-devel wrote:
>>         >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>>         >>> martinvonz at google.com <mailto:martinvonz at google.com>> wrote:
>>         >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld
>>         <boris.feld at octobus.net <mailto:boris.feld at octobus.net>> wrote:
>>         >>>>
>>         >>>>> # HG changeset patch
>>         >>>>> # User Boris Feld <boris.feld at octobus.net
>>         <mailto:boris.feld at octobus.net>>
>>         >>>>> # Date 1542916922 -3600
>>         >>>>> #Â  Â  Â  Thu Nov 22 21:02:02 2018 +0100
>>         >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>>         >>>>> # ParentÂ  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>>         >>>>> # EXP-Topic perf-ignore-2
>>         >>>>> # Available At
>>         https://bitbucket.org/octobus/mercurial-devel/
>>         >>>>> #Â  Â  Â  Â  Â  Â  Â  hg pull
>>         https://bitbucket.org/octobus/mercurial-devel/ -r
>>         >>>>> 018578f3ab59
>>         >>>>> match: avoid translating glob to matcher multiple times
>>         for large sets
>>         >>>>>
>>         >>>>> For hgignore with many globs, the resulting regexp
>>         might not fit under
>>         >>>>> the 20K
>>         >>>>> length limit. So the patterns need to be broken up in
>>         smaller pieces.
>>         >>>>>
>>         >>>> Did you see 0f6a1bdf89fb (match: handle large regexes,
>>         2007-08-19)
>>         >>>> and 59a9dc9562e2 (ignore: split up huge patterns,
>>         2008-02-11)? It might be
>>         >>>> worth trying to figure out what Python versions those
>>         commits are talking
>>         >>>> about. Maybe we've dropped support for those versions
>>         and we can simplify
>>         >>>> this code.
>>         >>>>
>>         >>> Oh, and what made me do the archaeology there was that
>>         you seem to have
>>         >>> lost the handling of OverlowError from the regex engine.
>>         As I said above, I
>>         >>> suspect that's fine because we no longer support some
>>         very old Python
>>         >>> versions (but please try to figure out what version that
>>         refers to). Still,
>>         >>> if we decide to drop that OverflowError handling, I'd
>>         prefer to see that in
>>         >>> an explicit commit early in this series.
>>         To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>>         59a9dc9562e2 (cannot trust the engine, preemptively raise our
>>         own error).
>>
>>
>>     Yes, perhaps (if it was only expressions longer than 20k that
>>     raised OverflowError). My point was that if that was the case, we
>>     should rewrite to avoid using an internal exception for flow
>>     control, i.e. change from this:
>>
>>     Â  Â  try:
>>     Â  Â  Â  Â  regex = # create regex
>>     Â  Â  Â  Â  if len(regex) > MAX_RE_SIZE:
>>     Â  Â  Â  Â  Â  Â  raise OverflowError
>>     Â  Â  Â  Â  return regex, _rematcher(regex)
>>     Â  Â  except OverflowError:
>>     Â  Â  Â  Â  # break up into smaller
>>
>>     to this:
>>
>>     Â  Â  regex = # create regex
>>     Â  Â  if len(regex) < MAX_RE_SIZE:
>>     Â  Â  Â  Â  return regex, _rematcher(regex)
>>     Â  Â  # break up into smaller
>     I don't disagree with the point. However, I do not know why it is
>     relevant to the current discussion. The series in review is no
>     longer using exception for flow control. What am I missing?
>
>
> My request was to make the removal of the support for OverflowError
> explicit. As I said above, it seems it was just lost in this patch
> without a mention. I've sentÂ D5309 to show what I mean. Your patch 6/6
> would go well right after that. You could rebase your series on top,
> or maybe I'm just being nitpicky and I should drop the patch.
Oh! I see now, it makes sense. It has been integrated into the series.
>
>>
>>     Â 
>>
>>
>>         So I feel like it is fine to just rely on the size limit.
>>         >> Perhaps it's been fixed since 2.7.4. The regexp code width
>>         is extended
>>         >> from 16bit to 32bit (or Py_UCS4) integer. That should be
>>         large enough to
>>         >> handle practical patterns.
>>         >>
>>         >> https://bugs.python.org/issue1160
>>
>>         Thanks for digging this out. It looks like we may be able to
>>         drop this
>>         limit altogether. However, I would like to make it a change
>>         distinct
>>         from this series.
>>
>>         The current code is very problematic for some people (to the
>>         point where
>>         the majority of `hg status` time is spent in that function).
>>         I would
>>         like to get fast code for the same semantic first. Then look into
>>         changing the semantic.
>>
>>
>>     Is your concern that you might regress in performance of
>>     something by changing how large the groups are? Or that it would
>>     be more work?
>
>     My concern is that we are currently delaying a simple code change
>     yielding a huge improvement because there is the option for
>     another larger, riskier one.
>
>     I'm not dismissing your finding about lifting the limit and
>     avoiding the regexps joining. They look very promising and we are
>     planning to explore them. However, they are a different change.
>
>     The current series is preserving all the assumption we have been
>     using for the past 10 years, implementing a faster code for the
>     same result. This is a very low risk and solves the most immediate
>     problem of a user group we support.
>
>     Dropping the limit is a different matter. It is probably harmless
>     to do so, but Mercurial runs on a variety of systems and regexp
>     engines. We'll have to look at all the case to make sure our new
>     assumption is valid.
>     The same goes for stopping the grouping, we'll have to check the
>     behavior on various real-life use cases to make sure it is valid.
>
>
> I suspect it will be best (depending on regex engine) to either not do
> any grouping, or to put all patterns in one group. I'd like (for
> someone) to see how re2 behaves here.
We'll look into it. Can we get the current changes landed in the meantime?
>
>
>     We are happy to explore these two ideas ourselves,
>
>
> Thanks!
> Â 
>
>     but I would prefer to see the faster grouping in first. This is a
>     good intermediate step that improves the current situation.
>
>     In addition, we won't have direct access to the real-life
>     repository until in about two months, so it will hard for us to
>     check the timing of a significantly new approach on real-life data.
>
>>
>>     I tried creating a regex for *every* pattern and that actually
>>     seemed faster (to my surprise), both when creating the matcher
>>     and when evaluating it. I tried it on the mozilla-unified repo
>>     both with 1k files and with 10k files in the hgignores. I used
>>     the following patch on top of your series.
>     Could you share more details on the patterns you generated, the
>     operation you performed and how you gathered the timings?
>
>
> Have to run to a meeting. Will get back to you later.
>
>>
>>     diff --git a/mercurial/match.py b/mercurial/match.py
>>     --- a/mercurial/match.py
>>     +++ b/mercurial/match.py
>>     @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
>>     Â  Â  Â else:
>>     Â  Â  Â  Â  Â return regex, lambda f: any(mf(f) for mf in matchfuncs)
>>
>>     -MAX_RE_SIZE = 20000
>>     -_BASE_SIZE = len('(?:)') - 1
>>     -
>>     -def _joinregexes(regexps):
>>     -Â  Â  """gather multiple regular expressions into a single one"""
>>     -Â  Â  return '(?:%s)' % '|'.join(regexps)
>>     -
>>     Â def _buildregexmatch(kindpats, globsuffix):
>>     Â  Â  Â """Build a match function from a list of kinds and kindpats,
>>     Â  Â  Â return regexp string and a matcher function.
>>     -
>>     -Â  Â  Test too large input
>>     -Â  Â  >>> _buildregexmatch([
>>     -Â  Â  ...Â  Â  Â ('relglob', '?' * MAX_RE_SIZE, '')
>>     -Â  Â  ... ], '$')
>>     -Â  Â  Traceback (most recent call last):
>>     -Â  Â  ...
>>     -Â  Â  Abort: matcher pattern is too long (20009 bytes)
>>     Â  Â  Â """
>>     Â  Â  Â try:
>>     -Â  Â  Â  Â  allgroups = []
>>     Â  Â  Â  Â  Â regexps = [_regex(k, p, globsuffix) for (k, p, s) in
>>     kindpats]
>>     -Â  Â  Â  Â  fullregexp = _joinregexes(regexps)
>>     -
>>     -Â  Â  Â  Â  startidx = 0
>>     -Â  Â  Â  Â  groupsize = _BASE_SIZE
>>     -Â  Â  Â  Â  for idx, r in enumerate(regexps):
>>     -Â  Â  Â  Â  Â  Â  piecesize = len(r)
>>     -Â  Â  Â  Â  Â  Â  if (piecesize + 4) > MAX_RE_SIZE:
>>     -Â  Â  Â  Â  Â  Â  Â  Â  msg = _("matcher pattern is too long (%d
>>     bytes)") % piecesize
>>     -Â  Â  Â  Â  Â  Â  Â  Â  raise error.Abort(msg)
>>     -Â  Â  Â  Â  Â  Â  elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
>>     -Â  Â  Â  Â  Â  Â  Â  Â  group = regexps[startidx:idx]
>>     -Â  Â  Â  Â  Â  Â  Â  Â  allgroups.append(_joinregexes(group))
>>     -Â  Â  Â  Â  Â  Â  Â  Â  startidx = idx
>>     -Â  Â  Â  Â  Â  Â  Â  Â  groupsize = _BASE_SIZE
>>     -Â  Â  Â  Â  Â  Â  groupsize += piecesize + 1
>>     -
>>     -Â  Â  Â  Â  if startidx == 0:
>>     -Â  Â  Â  Â  Â  Â  func = _rematcher(fullregexp)
>>     -Â  Â  Â  Â  else:
>>     -Â  Â  Â  Â  Â  Â  group = regexps[startidx:]
>>     -Â  Â  Â  Â  Â  Â  allgroups.append(_joinregexes(group))
>>     -Â  Â  Â  Â  Â  Â  allmatchers = [_rematcher(g) for g in allgroups]
>>     -Â  Â  Â  Â  Â  Â  func = lambda s: any(m(s) for m in allmatchers)
>>     +Â  Â  Â  Â  fullregexp = '(?:%s)' % '|'.join(regexps)
>>     +Â  Â  Â  Â  allmatchers = [_rematcher(r) for r in regexps]
>>     +Â  Â  Â  Â  func = lambda s: any(m(s) for m in allmatchers)
>>     Â  Â  Â  Â  Â return fullregexp, func
>>     Â  Â  Â except re.error:
>>     Â  Â  Â  Â  Â for k, p, s in kindpats:
>>
>>
>>     Â 
>>
>>
>>         > That said, combining more chunks of regex patterns might be
>>         likely to
>>         > lead to another funny problem.
>>         >
>>         > % python -c 'import re; re.compile("(a)" * 100)'
>>         > Traceback (most recent call last):
>>         >Â  Â File "<string>", line 1, in <module>
>>         >Â  Â File "/usr/lib/python2.7/re.py", line 194, in compile
>>         >Â  Â  Â return _compile(pattern, flags)
>>         >Â  Â File "/usr/lib/python2.7/re.py", line 249, in _compile
>>         >Â  Â  Â p = sre_compile.compile(pattern, flags)
>>         >Â  Â File "/usr/lib/python2.7/sre_compile.py", line 583, in
>>         compile
>>         >Â  Â  Â "sorry, but this version only supports 100 named groups"
>>         > AssertionError: sorry, but this version only supports 100
>>         named groups
>>         >
>>         > It's unrelated to the OverflowError issue, but splitting
>>         patterns could
>>         > help avoiding the 100-named-group problem.
>>
>>         By chance, my current gigantic use case does not involve
>>         named groups.
>>
>>         Catching AssertionError, will be fun. I wish there were some
>>         clean API
>>         to expose and check engine limitation.
>>
>>         > _______________________________________________
>>         > Mercurial-devel mailing list
>>         > Mercurial-devel at mercurial-scm.org
>>         <mailto:Mercurial-devel at mercurial-scm.org>
>>         > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>>
>>     _______________________________________________
>>     Mercurial-devel mailing list
>>     Mercurial-devel at mercurial-scm.org <mailto:Mercurial-devel at mercurial-scm.org>
>>     https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20181201/879a4d9b/attachment.html>