[PATCH] revset: improve performance of _generatorset.__contains__ (issue 4201)

Gregory Szorc gregory.szorc at gmail.com
Tue Mar 25 18:12:01 CDT 2014


On 3/25/14, 3:38 PM, Durham Goode wrote:
> On 3/25/14 2:26 PM, "Gregory Szorc" <gregory.szorc at gmail.com> wrote:
>
>> On 3/25/14, 12:31 PM, Augie Fackler wrote:
>>> On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote:
>>>> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc at gmail.com> wrote:
>>>>
>>>>> # HG changeset patch
>>>>> # User Gregory Szorc <gregory.szorc at gmail.com>
>>>>> # Date 1395716418 25200
>>>>> #      Mon Mar 24 20:00:18 2014 -0700
>>>>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40
>>>>> # Parent  3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5
>>>>> revset: improve performance of _generatorset.__contains__ (issue 4201)
>>>>
>>>>
>>>> Looks good to me.  I tried it in our repo using this revset:
>>>
>>> Agreed. Queued.
>>
>> FWIW, I've been running with this patch all day today and I've noticed
>> performance improvements all across the board. Rebases are obviously
>> faster. Evolve is much faster as well. Mercurial no longer feels
>> sluggish when operating on my 200,000+ changeset repo.
>
> I just found a bug in this patch.  In:
>
>      def __iter__(self):
>          if self._iterated:
>              # At least a part of the list should be cached if iteration has
>              # started over the generatorset.
>              for l in self._genlist:
>                  yield l
>
>          for item in self._consumegen():
>              yield item
>
>      def _consumegen(self):
>          self._iterated = True
>
>          for item in self._gen:
>              self._cache[item] = True
>              self._genlist.append(item)
>              yield item
>
>
> If two revsets are running _consumegen on the same revset at the same
> time, they could both be stepping through self._gen inside of their
> respective _consumegen calls.  Causing them to see different, mutually
> exclusive outputs of _consumegen.
>
> I noticed it due to changes I'm making to improve root() and
> _descendants(), so I don't have a repro for @ at the moment.  I'll try to
> think up a fix.

Oh, hmmm.

I think you meant to say "respective __iter__ calls" right? I don't see 
the problem of having multiple consumers of _consumegen, as each 
_consumegen frame will be consuming from the same underlying generator 
and the results of consuming from that generator should be idempotent, 
regardless of which consumer was doing it.

I see a few solutions:

1) Check the expected length of self._genlist after each item from 
_consumegen is obtained in __iter__. If we didn't resume where we left 
off (someone else consumed via _consumegen), yield the missing elements 
first.

2) Implement the iterator protocol manually. See 
http://docs.python.org/2/library/stdtypes.html#generator-types. 
Essentially, "yield" does magical things. We could implement our own 
iterator type and/or perform manual traversal of the underlying generator.

I think #2 - while more complicated - might be more readable. Although 
it involves light Python magic.


More information about the Mercurial-devel mailing list