[PATCH 5 of 6 V2] match: avoid translating glob to matcher multiple times for large sets

Boris FELD boris.feld at octobus.net
Fri Nov 23 12:20:48 EST 2018


On 23/11/2018 10:24, Yuya Nishihara wrote:
> On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote:
>>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>>> martinvonz at google.com> wrote:
>>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld at octobus.net> wrote:
>>>>
>>>>> # HG changeset patch
>>>>> # User Boris Feld <boris.feld at octobus.net>
>>>>> # Date 1542916922 -3600
>>>>> #      Thu Nov 22 21:02:02 2018 +0100
>>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>>>>> # EXP-Topic perf-ignore-2
>>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>>>>> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
>>>>> 018578f3ab59
>>>>> match: avoid translating glob to matcher multiple times for large sets
>>>>>
>>>>> For hgignore with many globs, the resulting regexp might not fit under
>>>>> the 20K
>>>>> length limit. So the patterns need to be broken up in smaller pieces.
>>>>>
>>>> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
>>>> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be
>>>> worth trying to figure out what Python versions those commits are talking
>>>> about. Maybe we've dropped support for those versions and we can simplify
>>>> this code.
>>>>
>>> Oh, and what made me do the archaeology there was that you seem to have
>>> lost the handling of OverlowError from the regex engine. As I said above, I
>>> suspect that's fine because we no longer support some very old Python
>>> versions (but please try to figure out what version that refers to). Still,
>>> if we decide to drop that OverflowError handling, I'd prefer to see that in
>>> an explicit commit early in this series.
To me, 0f6a1bdf89fb (catching error from engine) is superseded by
59a9dc9562e2 (cannot trust the engine, preemptively raise our own error).

So I feel like it is fine to just rely on the size limit.
>> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
>> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
>> handle practical patterns.
>>
>> https://bugs.python.org/issue1160

Thanks for digging this out. It looks like we may be able to drop this
limit altogether. However, I would like to make it a change distinct
from this series.

The current code is very problematic for some people (to the point where
the majority of `hg status` time is spent in that function). I would
like to get fast code for the same semantic first. Then look into
changing the semantic.

> That said, combining more chunks of regex patterns might be likely to
> lead to another funny problem.
>
> % python -c 'import re; re.compile("(a)" * 100)'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/usr/lib/python2.7/re.py", line 194, in compile
>     return _compile(pattern, flags)
>   File "/usr/lib/python2.7/re.py", line 249, in _compile
>     p = sre_compile.compile(pattern, flags)
>   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>     "sorry, but this version only supports 100 named groups"
> AssertionError: sorry, but this version only supports 100 named groups
>
> It's unrelated to the OverflowError issue, but splitting patterns could
> help avoiding the 100-named-group problem.

By chance, my current gigantic use case does not involve named groups.

Catching AssertionError, will be fun. I wish there were some clean API
to expose and check engine limitation.

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


More information about the Mercurial-devel mailing list