[PATCH] revset: improve performance of _generatorset.contains (issue 4201)

Tue Mar 25 18:13:20 CDT 2014

On 3/25/14 4:12 PM, "Gregory Szorc" <gregory.szorc at gmail.com> wrote:

>On 3/25/14, 3:38 PM, Durham Goode wrote:
>> On 3/25/14 2:26 PM, "Gregory Szorc" <gregory.szorc at gmail.com> wrote:
>>
>>> On 3/25/14, 12:31 PM, Augie Fackler wrote:
>>>> On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote:
>>>>> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc at gmail.com> wrote:
>>>>>
>>>>>> # HG changeset patch
>>>>>> # User Gregory Szorc <gregory.szorc at gmail.com>
>>>>>> # Date 1395716418 25200
>>>>>> #      Mon Mar 24 20:00:18 2014 -0700
>>>>>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40
>>>>>> # Parent  3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5
>>>>>> revset: improve performance of _generatorset.__contains__ (issue
>>>>>>4201)
>>>>>
>>>>>
>>>>> Looks good to me.  I tried it in our repo using this revset:
>>>>
>>>> Agreed. Queued.
>>>
>>> FWIW, I've been running with this patch all day today and I've noticed
>>> performance improvements all across the board. Rebases are obviously
>>> faster. Evolve is much faster as well. Mercurial no longer feels
>>> sluggish when operating on my 200,000+ changeset repo.
>>
>> I just found a bug in this patch.  In:
>>
>>      def __iter__(self):
>>          if self._iterated:
>>              # At least a part of the list should be cached if
>>iteration has
>>              # started over the generatorset.
>>              for l in self._genlist:
>>                  yield l
>>
>>          for item in self._consumegen():
>>              yield item
>>
>>      def _consumegen(self):
>>          self._iterated = True
>>
>>          for item in self._gen:
>>              self._cache[item] = True
>>              self._genlist.append(item)
>>              yield item
>>
>>
>> If two revsets are running _consumegen on the same revset at the same
>> time, they could both be stepping through self._gen inside of their
>> respective _consumegen calls.  Causing them to see different, mutually
>> exclusive outputs of _consumegen.
>>
>> I noticed it due to changes I'm making to improve root() and
>> _descendants(), so I don't have a repro for @ at the moment.  I'll try
>>to
>> think up a fix.
>
>Oh, hmmm.
>
>I think you meant to say "respective __iter__ calls" right? I don't see
>the problem of having multiple consumers of _consumegen, as each
>_consumegen frame will be consuming from the same underlying generator
>and the results of consuming from that generator should be idempotent,
>regardless of which consumer was doing it.
>
>I see a few solutions:
>
>1) Check the expected length of self._genlist after each item from
>_consumegen is obtained in __iter__. If we didn't resume where we left
>off (someone else consumed via _consumegen), yield the missing elements
>first.
>
>2) Implement the iterator protocol manually. See
>https://urldefense.proofpoint.com/v1/url?u=http://docs.python.org/2/librar
>y/stdtypes.html%23generator-types&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=pHOG
>6Hz51SkYmYr%2FxoTFzw%3D%3D%0A&m=vZ%2FOy4UTZKbqEw13a2p0h7jGROCW40ZPTQADNyt1
>0ok%3D%0A&s=171f576db00836698a9903475016ff0a991c99b2db8604948fde0b6f12e27c
>c6. 
>Essentially, "yield" does magical things. We could implement our own
>iterator type and/or perform manual traversal of the underlying generator.
>
>I think #2 - while more complicated - might be more readable. Although
>it involves light Python magic.
>

I've already created a patch that does #1.  Running tests on it now.  I'll
either send it out by itself, or with my other perf patches.

[PATCH] revset: improve performance of _generatorset.__contains__ (issue 4201)

[PATCH] revset: improve performance of _generatorset.contains (issue 4201)