[PATCH 5 of 6 V2] match: avoid translating glob to matcher multiple times for large sets
Boris FELD
boris.feld at octobus.net
Sat Dec 1 12:05:55 UTC 2018
On 28/11/2018 19:30, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
> On Wed, Nov 28, 2018 at 7:41 AM Boris FELD <boris.feld at octobus.net
> <mailto:boris.feld at octobus.net>> wrote:
>
>
> On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>>
>>
>> On Fri, Nov 23, 2018 at 9:20 AM Boris FELD
>> <boris.feld at octobus.net <mailto:boris.feld at octobus.net>> wrote:
>>
>>
>> On 23/11/2018 10:24, Yuya Nishihara wrote:
>> > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>> >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk
>> via Mercurial-devel wrote:
>> >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>> >>> martinvonz at google.com <mailto:martinvonz at google.com>> wrote:
>> >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld
>> <boris.feld at octobus.net <mailto:boris.feld at octobus.net>> wrote:
>> >>>>
>> >>>>> # HG changeset patch
>> >>>>> # User Boris Feld <boris.feld at octobus.net
>> <mailto:boris.feld at octobus.net>>
>> >>>>> # Date 1542916922 -3600
>> >>>>> #Â Â Â Thu Nov 22 21:02:02 2018 +0100
>> >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>> >>>>> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>> >>>>> # EXP-Topic perf-ignore-2
>> >>>>> # Available At
>> https://bitbucket.org/octobus/mercurial-devel/
>> >>>>> #Â Â Â Â Â Â Â hg pull
>> https://bitbucket.org/octobus/mercurial-devel/ -r
>> >>>>> 018578f3ab59
>> >>>>> match: avoid translating glob to matcher multiple times
>> for large sets
>> >>>>>
>> >>>>> For hgignore with many globs, the resulting regexp
>> might not fit under
>> >>>>> the 20K
>> >>>>> length limit. So the patterns need to be broken up in
>> smaller pieces.
>> >>>>>
>> >>>> Did you see 0f6a1bdf89fb (match: handle large regexes,
>> 2007-08-19)
>> >>>> and 59a9dc9562e2 (ignore: split up huge patterns,
>> 2008-02-11)? It might be
>> >>>> worth trying to figure out what Python versions those
>> commits are talking
>> >>>> about. Maybe we've dropped support for those versions
>> and we can simplify
>> >>>> this code.
>> >>>>
>> >>> Oh, and what made me do the archaeology there was that
>> you seem to have
>> >>> lost the handling of OverlowError from the regex engine.
>> As I said above, I
>> >>> suspect that's fine because we no longer support some
>> very old Python
>> >>> versions (but please try to figure out what version that
>> refers to). Still,
>> >>> if we decide to drop that OverflowError handling, I'd
>> prefer to see that in
>> >>> an explicit commit early in this series.
>> To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>> 59a9dc9562e2 (cannot trust the engine, preemptively raise our
>> own error).
>>
>>
>> Yes, perhaps (if it was only expressions longer than 20k that
>> raised OverflowError). My point was that if that was the case, we
>> should rewrite to avoid using an internal exception for flow
>> control, i.e. change from this:
>>
>> Â Â try:
>> Â Â Â Â regex = # create regex
>> Â Â Â Â if len(regex) > MAX_RE_SIZE:
>> Â Â Â Â Â Â raise OverflowError
>> Â Â Â Â return regex, _rematcher(regex)
>> Â Â except OverflowError:
>> Â Â Â Â # break up into smaller
>>
>> to this:
>>
>> Â Â regex = # create regex
>> Â Â if len(regex) < MAX_RE_SIZE:
>> Â Â Â Â return regex, _rematcher(regex)
>> Â Â # break up into smaller
> I don't disagree with the point. However, I do not know why it is
> relevant to the current discussion. The series in review is no
> longer using exception for flow control. What am I missing?
>
>
> My request was to make the removal of the support for OverflowError
> explicit. As I said above, it seems it was just lost in this patch
> without a mention. I've sent D5309 to show what I mean. Your patch 6/6
> would go well right after that. You could rebase your series on top,
> or maybe I'm just being nitpicky and I should drop the patch.
Oh! I see now, it makes sense. It has been integrated into the series.
>
>>
>> Â
>>
>>
>> So I feel like it is fine to just rely on the size limit.
>> >> Perhaps it's been fixed since 2.7.4. The regexp code width
>> is extended
>> >> from 16bit to 32bit (or Py_UCS4) integer. That should be
>> large enough to
>> >> handle practical patterns.
>> >>
>> >> https://bugs.python.org/issue1160
>>
>> Thanks for digging this out. It looks like we may be able to
>> drop this
>> limit altogether. However, I would like to make it a change
>> distinct
>> from this series.
>>
>> The current code is very problematic for some people (to the
>> point where
>> the majority of `hg status` time is spent in that function).
>> I would
>> like to get fast code for the same semantic first. Then look into
>> changing the semantic.
>>
>>
>> Is your concern that you might regress in performance of
>> something by changing how large the groups are? Or that it would
>> be more work?
>
> My concern is that we are currently delaying a simple code change
> yielding a huge improvement because there is the option for
> another larger, riskier one.
>
> I'm not dismissing your finding about lifting the limit and
> avoiding the regexps joining. They look very promising and we are
> planning to explore them. However, they are a different change.
>
> The current series is preserving all the assumption we have been
> using for the past 10 years, implementing a faster code for the
> same result. This is a very low risk and solves the most immediate
> problem of a user group we support.
>
> Dropping the limit is a different matter. It is probably harmless
> to do so, but Mercurial runs on a variety of systems and regexp
> engines. We'll have to look at all the case to make sure our new
> assumption is valid.
> The same goes for stopping the grouping, we'll have to check the
> behavior on various real-life use cases to make sure it is valid.
>
>
> I suspect it will be best (depending on regex engine) to either not do
> any grouping, or to put all patterns in one group. I'd like (for
> someone) to see how re2 behaves here.
We'll look into it. Can we get the current changes landed in the meantime?
>
>
> We are happy to explore these two ideas ourselves,
>
>
> Thanks!
> Â
>
> but I would prefer to see the faster grouping in first. This is a
> good intermediate step that improves the current situation.
>
> In addition, we won't have direct access to the real-life
> repository until in about two months, so it will hard for us to
> check the timing of a significantly new approach on real-life data.
>
>>
>> I tried creating a regex for *every* pattern and that actually
>> seemed faster (to my surprise), both when creating the matcher
>> and when evaluating it. I tried it on the mozilla-unified repo
>> both with 1k files and with 10k files in the hgignores. I used
>> the following patch on top of your series.
> Could you share more details on the patterns you generated, the
> operation you performed and how you gathered the timings?
>
>
> Have to run to a meeting. Will get back to you later.
>
>>
>> diff --git a/mercurial/match.py b/mercurial/match.py
>> --- a/mercurial/match.py
>> +++ b/mercurial/match.py
>> @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
>> Â Â Â else:
>> Â Â Â Â Â return regex, lambda f: any(mf(f) for mf in matchfuncs)
>>
>> -MAX_RE_SIZE = 20000
>> -_BASE_SIZE = len('(?:)') - 1
>> -
>> -def _joinregexes(regexps):
>> -Â Â """gather multiple regular expressions into a single one"""
>> -Â Â return '(?:%s)' % '|'.join(regexps)
>> -
>> Â def _buildregexmatch(kindpats, globsuffix):
>> Â Â Â """Build a match function from a list of kinds and kindpats,
>> Â Â Â return regexp string and a matcher function.
>> -
>> -Â Â Test too large input
>> -Â Â >>> _buildregexmatch([
>> -  ...   ('relglob', '?' * MAX_RE_SIZE, '')
>> -Â Â ... ], '$')
>> -Â Â Traceback (most recent call last):
>> -Â Â ...
>> -Â Â Abort: matcher pattern is too long (20009 bytes)
>> Â Â Â """
>> Â Â Â try:
>> -Â Â Â Â allgroups = []
>> Â Â Â Â Â regexps = [_regex(k, p, globsuffix) for (k, p, s) in
>> kindpats]
>> -Â Â Â Â fullregexp = _joinregexes(regexps)
>> -
>> -Â Â Â Â startidx = 0
>> -Â Â Â Â groupsize = _BASE_SIZE
>> -Â Â Â Â for idx, r in enumerate(regexps):
>> -Â Â Â Â Â Â piecesize = len(r)
>> -Â Â Â Â Â Â if (piecesize + 4) > MAX_RE_SIZE:
>> -Â Â Â Â Â Â Â Â msg = _("matcher pattern is too long (%d
>> bytes)") % piecesize
>> -Â Â Â Â Â Â Â Â raise error.Abort(msg)
>> -Â Â Â Â Â Â elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
>> -Â Â Â Â Â Â Â Â group = regexps[startidx:idx]
>> -Â Â Â Â Â Â Â Â allgroups.append(_joinregexes(group))
>> -Â Â Â Â Â Â Â Â startidx = idx
>> -Â Â Â Â Â Â Â Â groupsize = _BASE_SIZE
>> -Â Â Â Â Â Â groupsize += piecesize + 1
>> -
>> -Â Â Â Â if startidx == 0:
>> -Â Â Â Â Â Â func = _rematcher(fullregexp)
>> -Â Â Â Â else:
>> -Â Â Â Â Â Â group = regexps[startidx:]
>> -Â Â Â Â Â Â allgroups.append(_joinregexes(group))
>> -Â Â Â Â Â Â allmatchers = [_rematcher(g) for g in allgroups]
>> -Â Â Â Â Â Â func = lambda s: any(m(s) for m in allmatchers)
>> +Â Â Â Â fullregexp = '(?:%s)' % '|'.join(regexps)
>> +Â Â Â Â allmatchers = [_rematcher(r) for r in regexps]
>> +Â Â Â Â func = lambda s: any(m(s) for m in allmatchers)
>> Â Â Â Â Â return fullregexp, func
>> Â Â Â except re.error:
>> Â Â Â Â Â for k, p, s in kindpats:
>>
>>
>> Â
>>
>>
>> > That said, combining more chunks of regex patterns might be
>> likely to
>> > lead to another funny problem.
>> >
>> > % python -c 'import re; re.compile("(a)" * 100)'
>> > Traceback (most recent call last):
>> >Â Â File "<string>", line 1, in <module>
>> >Â Â File "/usr/lib/python2.7/re.py", line 194, in compile
>> >Â Â Â return _compile(pattern, flags)
>> >Â Â File "/usr/lib/python2.7/re.py", line 249, in _compile
>> >Â Â Â p = sre_compile.compile(pattern, flags)
>> >Â Â File "/usr/lib/python2.7/sre_compile.py", line 583, in
>> compile
>> >Â Â Â "sorry, but this version only supports 100 named groups"
>> > AssertionError: sorry, but this version only supports 100
>> named groups
>> >
>> > It's unrelated to the OverflowError issue, but splitting
>> patterns could
>> > help avoiding the 100-named-group problem.
>>
>> By chance, my current gigantic use case does not involve
>> named groups.
>>
>> Catching AssertionError, will be fun. I wish there were some
>> clean API
>> to expose and check engine limitation.
>>
>> > _______________________________________________
>> > Mercurial-devel mailing list
>> > Mercurial-devel at mercurial-scm.org
>> <mailto:Mercurial-devel at mercurial-scm.org>
>> > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>>
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel at mercurial-scm.org <mailto:Mercurial-devel at mercurial-scm.org>
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20181201/879a4d9b/attachment.html>
More information about the Mercurial-devel
mailing list