[PATCH 3 of 6] revset: do not try to sort revisions unnecessarily by addset._iterator()

Thu May 14 00:38:32 CDT 2015

On 05/13/2015 07:39 AM, Yuya Nishihara wrote:
> On Tue, 12 May 2015 17:52:30 -0700, Pierre-Yves David wrote:
>> On 05/12/2015 03:59 PM, Yuya Nishihara wrote:
>>> # HG changeset patch
>>> # User Yuya Nishihara <yuya at tcha.org>
>>> # Date 1427716479 -32400
>>> #      Mon Mar 30 20:54:39 2015 +0900
>>> # Node ID cc7253d43150ef937aca205f2cb66a569378aeea
>>> # Parent  e487cb80cc333a4d92376daa29c08fa703c5b249
>>> revset: do not try to sort revisions unnecessarily by addset._iterator()
>>>
>>> This fixes addset to remove duplicated revisions from right-hand-side set
>>> even if one of the subset does not support fastasc/fastdesc.
>>>
>>> addset._iterator() did not do the right thing if _ascending is set, because
>>> addset._iterordered() expects that the given sets are sorted. This patch
>>> removes the broken implementation. There's no need for addset._iterator()
>>> to sort revisions because the result will be sorted by _trysetasclist()
>>> as necessary.
>>
>> I agree the current behavior is buggy. But I'm dubious about your the
>> way to solve it.
>>
>> The special casing of 'self._ascending' is a good correctness and
>> optimisation purpose.
>>
>> I see two potential issues here:
>>
>> 1) we use iterator from self._r1 and self._r2 without ensuring they are
>> sorted in the order we need them to be sorted (or using appropriate
>> directed iterator).
>>
>> 2) _iterordered is apparently not doing deduplication.
>>
>> Fixing these two should solve the bug without performance and
>> correctness impact?
>>
>> As far as I understand your patch,  sorting the addset would result into
>> a sorted output most case.
>
> _iterordered() can remove duplicates if both r1 and r2 are sorted. So we have
> to fix (1).

This small patch seems to fix the duplication issue and fix the problem.

diff --git a/mercurial/revset.py b/mercurial/revset.py
--- a/mercurial/revset.py
+++ b/mercurial/revset.py
@@ -3044,4 +3044,6 @@ class addset(abstractsmartset):
              gen = gen()
          else:
+            self._r1.sort(not self._ascending)
+            self._r2.sort(not self._ascending)
              iter1 = iter(self._r1)
              iter2 = iter(self._r2)

> But, when _iterator() is called with _ascending is not None (*1), we know at
> least one of r1 or r2 does not support fastasc/fastdesc. So we'll have to
> consume r1 or r2 anyway to get it sorted. I think the use of _iterordered() had
> little benefit in this case.
>
>   (*1) except when called from __len__(), which can be assumed to be slow

Is _ascending is not None, the iteration is requested to be done in a 
sorted order. I would assume than using _iterator here could give us 
data in the wrong order, but your test seems to disagree. I'll give a 
deeper look at why the code still behave right tomorrow.

In all cases, relying on sorting in the subset seems wise since each 
subset may have its own efficient way to get data in the right order.

>> I also would like you to check performance impact when touching revset
>> code. This means looking at the output from
>> 'contrib/revsetbenchmarks.py' with at least content of
>> 'contrib/revsetbenchmarks.t' (if the help of the script is too bad,
>> point it out and we'll improve that).
>
> Hmm, I got notable difference by "roots((0:tip)::)". This seems to be caused
> by the additional for-loop introduced by "revset: make addset class accepts
> more than two subsets". Perhaps, I'll have to optimize addset.__contains__ and
> _iterordered() for two subsets.

This is fairly hot code path we have to be careful.

Thanks for looking into this.


-- 
Pierre-Yves David