[PATCH 1 of 7] revset: introduce a _revinterval type and revsubset function

Fri Jun 22 17:37:42 CDT 2012

On Fri, 2012-06-22 at 17:04 -0400, Bryan O'Sullivan wrote:
> On Mon, Jun 18, 2012 at 6:34 PM, Matt Mackall <mpm at selenic.com> wrote:
>         > +    def __getitem__(self, i):
>         > +        l = len(self)
>         > +        if isinstance(i, slice):
>         
>         
>         This method looks pretty heavy for a function we'll be calling
>         in an
>         inner loop. Do we actually use slices?
> 
> 
> I would have sworn the answer was yes, as I have a strong memory of
> adding the slice check due to some obscure test failing. And yet when
> I try it now with that block disabled, nothing is failing. Go figure.
>  
>         If we take something like:
>         
>          r = [x for x in a if x in b]
>         
>         ..I'm rather skeptical that this sort of implementation can
>         beat the
>         native set implementation unless a is much smaller than b.
>         That might
>         happen a lot, but it's worrisome.
> 
> 
> It is very common that b is set(range(len(repo))), in which case
> merely constructing the set is nastily expensive in time and memory on
> even a moderately large repo.

I believe it. But when a is _also_ large, it's not clear that it's a
win, due to the expense of _revinterval._contains_.

Let's assume:

 len(a) = len(b) = n

Now compare:

 setb = set(b) # O(n) with small constant
 [x for x in a if x in setb] # O(n) with small constant

with:

 intb = interval(b) # O(1)!
 [x for x in a if x in intb] # O(n) with large constant

For your version, if time(x in intervalb) > ~2 * time(x in setb), we
can't win even if time(interval(b)) = 0.

And I strongly suspect the cost of your contains function is much larger
than 2x the built-in set method or the set constructor because it's a
nontrivial Python member function call. Just for starters, every single
self.foo lookup inside the call will be as slow as x in setb.

So while it wins for small a with large b, it loses for len(a) ~= len(b)
or len(a) > len(b). And I'm not really sure that len(a) << len(b) on
average.

> Honestly, this does all really really want to be a custom extension
> type written in C, using a tree of contiguous ranges. This would
> preserve order, maintain good cache performance, and keep memory use
> as low as possible.

I certainly think having a fancier data structure than a list or a set
is an interesting notion. But I don't know that we actually have many
queries that need a full interval tree. We tend to have either:

- a big fat range
- a sparse set

So we really just need a way to efficiently do these four ops:

 filter(sparse, sparse) # like most of the current code
 filter(sparse, range) # like your code
 filter(range, range) # this could be really fast
 filter(range, sparse) # tricky

Now imagine a type that looks like:

 class rset(object):
   def __init__(iterable=None, start=None, stop=None):
     ...
   def __len__(self):
     ...
   def list(self):
     ...
   def set(self):
     ...
   def filter(self, rset2):
     ...

This can make its own decision about which of the four filter strategies
to use, and create and cache list and set objects as needed from
whatever you pass the constructor. And when you happen to filter an
interval by an interval, it can be O(1).

(Note intentional lack of "poking about inside" and "looking like a
normal iterable" methods.)

-- 
Mathematics is the supreme nostalgia of our time.