Revset-based implementation of log command

Tue Apr 17 12:05:20 CDT 2012

On Tue, 2012-04-17 at 18:36 +0200, Patrick Mézard wrote:
> Le 16/04/12 20:06, Matt Mackall a écrit :
> > On Fri, 2012-04-13 at 22:15 +0200, Patrick Mézard wrote:
> >> Le 12/04/12 22:49, Matt Mackall a écrit :
> >>> On Thu, 2012-04-12 at 21:52 +0200, Patrick Mézard wrote:
> >>>> [3] could be fixed by turning revset.match() into a generator. It is
> >>>> not clear which revset functions can operate on a "subset" generator
> >>>> rather than a list but the simple ones like user() definitely can.
> >>>> Also, I have no idea how much overhead this will add compared to the
> >>>> list version.
> >>>
> >>> Any predicate can choose to turn its args from generators into lists, if
> >>> need be, so it shouldn't be all that hard architecturally to get some of
> >>> this working.
> >>
> >> To make this valuable for log command we need to turn the revsets used
> >> by graphlog._makelogrevset() into pure "filters", that is generator
> >> loops over the "subset" variable. It looks difficult for two of them:
> >> notset() and orset().
> >>
> >> notset() is used for --no-merge and --prune. One solution is to
> >> compute the negated expression set not over "subset" but over all
> >> revisions. This would not affect --prune much. And if we really want
> >> speed for --no-merge, we can write a dedicated revset implementing it
> >> directly instead of negating a revision set.
> > 
> > I agree that notset is tricky to turn into an iterator. The primary
> > difficulty is that given an iterator i for a predicate x, returning
> > things not in i without consuming all of i is hard unless x has helpful
> > properties like being sorted. So if we have a revset:
> 
> [...]
> 
> > Which all suggests that we could teach the optimizer to look at the
> > parse tree of not's subexpression and decide that a) it's going to be
> > expensive to evaluate and b) it's independent, so we use
> > independentnotset on it.
> 
> I agree with all this. At this time, I was more concerned about the
> two I was using to minimize the work to refactor log command. All
> these problems are pretty similar to what you get when dealing with an
> SQL optimizer.
>  
> > Which brings us back to not(merge()). Evaluating this in the current
> > revset code means visiting the entire index and checking for second
> > parents, but that's quite cheap, and it seems worth doing all the work
> > up front in any case. But for not(user(x)), we want to go the
> > independent route because the subexpression is expensive.
> > 
> >> orset() cannot be turned into a filter in the general case. Here, we
> >> often have an additional property: input revisions are sorted by
> >> value. In this case, we can write an orset() version computing both
> >> "or" branches on "subset" (instead of subset and (subset - "first
> >> branch output")) and use the revision order to merge back the
> >> iterable, then preserving the order again. Again, this would be a
> >> dedicated revset.
> > 
> > I don't see why this can't be an iterable. If we do 'x or y', we simply
> > need to yield all the elements of x before we start yielding element in
> > 'y - x'. And that just means we need to remember the elements of we've
> > seen of x as we go. Even if either set is a non-independent revset.
> 
> Right, and in fact I did this.
> 
> But for log command I need another property I stupidly removed from my
> previous post: I need orset() to preserve input order. If not, stuff
> like "--user foo --user bar" returns two concatenated stripes of
> history which is a regression, unless we sort at the end and lose the
> iterable property again. That is what my more constrained version
> deals with.

I think if we have "x or y" and both are independent revsets as defined
above, we can still make an orset() iterator that avoids duplicate
checks:

# both subsets are independent, iterate element by element
for r in subset:
  # the optimizer has arranged that checking x is cheaper than y
  if r in getset(repo, [r], x):
    yield r
  elif r in getset(repo, [r], y):
    yield r

Otherwise, yes, we have to collect the full subsets to calculate an or
with the query order preserved. I'm not sure whether we can/should
redefine orset() to preserve that order though.

-- 
Mathematics is the supreme nostalgia of our time.