[PATCH v4] strip: incrementally update the branchheads cache after a strip

Tue May 29 15:39:29 CDT 2012

comments inline

On 5/24/12 4:10 PM, "Matt Mackall" <mpm at selenic.com> wrote:

>On Wed, 2012-05-23 at 14:15 -0700, Joshua Redstone wrote:
>> # HG changeset patch
>> # User Joshua Redstone <joshua.redstone at fb.com>
>> # Date 1337370347 25200
>> # Node ID cb504dab484a34f21656fc594f0a58f65a653a23
>> # Parent  d0b9ebba41e9a1733294d5fa1b497ada5eda93c8
>> strip: incrementally update the branchheads cache after a strip
>> 
>> This function augments strip to incrementally update the branchheads
>>cache
>> rather than recompute it from scratch.  This speeds up the performance
>>of strip
>> and rebase on repos with long history.  The performance optimization
>>only
>> happens if the revisions stripped are all on the same branch and the
>>parents of
>> the stripped revisions are also on that same branch.
>> 
>> This also fixes a small performance bug in
>>localrepo._updatebranchcache.  That
>> function would iterate over new candidate heads in order from oldest to
>>newest,
>> in contrast to the comment that observes that iterating from newest to
>>oldest
>> results in fewer pases over the set of reachable nodes.
>> 
>> This adds a few test cases, particularly one that reproduces the extra
>>heads
>> that mpm observed.
>> 
>> diff -r d0b9ebba41e9 -r cb504dab484a hgext/mq.py
>> --- a/hgext/mq.py	Sun May 20 14:40:36 2012 -0500
>> +++ b/hgext/mq.py	Fri May 18 12:45:47 2012 -0700
>> @@ -3449,7 +3449,7 @@
>>              if start < qbase:
>>                  # update the cache (excluding the patches) and save it
>>                  ctxgen = (self[r] for r in xrange(lrev + 1, qbase))
>> -                self._updatebranchcache(partial, ctxgen)
>> +                self._updatebranchcache(partial, ctxgen, ctxisnew=True)
>
>I should probably add a note about keyword args to CodingStyle,
>something like:
>
>"Avoid using keyword and default arguments wherever reasonable. Use
>positional args for arguments that are almost always used."

Yeah, this also goes away because of your observation below.

>
>>                  self._writebranchcache(partial, cl.node(qbase - 1),
>>qbase - 1)
>>                  start = qbase
>>              # if start = qbase, the cache is as updated as it should
>>be.
>> @@ -3458,7 +3458,7 @@
>>  
>>              # update the cache up to the tip
>>              ctxgen = (self[r] for r in xrange(start, len(cl)))
>> -            self._updatebranchcache(partial, ctxgen)
>> +            self._updatebranchcache(partial, ctxgen, ctxisnew=True)
>>  
>>              return partial
>>  
>> diff -r d0b9ebba41e9 -r cb504dab484a mercurial/discovery.py
>> --- a/mercurial/discovery.py	Sun May 20 14:40:36 2012 -0500
>> +++ b/mercurial/discovery.py	Fri May 18 12:45:47 2012 -0700
>> @@ -201,7 +201,7 @@
>>          # 4. Update newmap with outgoing changes.
>>          # This will possibly add new heads and remove existing ones.
>>          ctxgen = (repo[n] for n in outgoing.missing)
>> -        repo._updatebranchcache(newmap, ctxgen)
>> +        repo._updatebranchcache(newmap, ctxgen, ctxisnew=True)
>>  
>>      else:
>>          # 1-4b. old servers: Check for new topological heads.
>> diff -r d0b9ebba41e9 -r cb504dab484a mercurial/localrepo.py
>> --- a/mercurial/localrepo.py	Sun May 20 14:40:36 2012 -0500
>> +++ b/mercurial/localrepo.py	Fri May 18 12:45:47 2012 -0700
>> @@ -480,7 +480,7 @@
>>          tiprev = len(self) - 1
>>          if lrev != tiprev:
>>              ctxgen = (self[r] for r in xrange(lrev + 1, tiprev + 1))
>> -            self._updatebranchcache(partial, ctxgen)
>> +            self._updatebranchcache(partial, ctxgen, ctxisnew=True)
>>              self._writebranchcache(partial, self.changelog.tip(),
>>tiprev)
>>  
>>          return partial
>> @@ -550,6 +550,9 @@
>>                      continue
>>                  node, label = l.split(" ", 1)
>>                  label = encoding.tolocal(label.strip())
>> +                if not node in self:
>> +                    raise ValueError('invalidating branch cache
>>because node '+
>> +                                     '%s does not exist' % node)
>>                  partial.setdefault(label, []).append(bin(node))
>>          except KeyboardInterrupt:
>>              raise
>> @@ -570,7 +573,14 @@
>>          except (IOError, OSError):
>>              pass
>>  
>> -    def _updatebranchcache(self, partial, ctxgen):
>> +    def _updatebranchcache(self, partial, ctxgen, ctxisnew):
>> +        """Given a branchhead cache, partial, that may have extra
>>nodes or be
>> +        missing heads, and a generator of nodes that are at least a
>>superset of
>> +        heads missing, this function updates partial to be correct.
>> +
>> +        ctxisnew is a bool specifying whether the nodes produced by
>>ctxgen are
>> +        guaranteed to not be ancestors of any existing heads.
>
>But this is neither a keyword arg nor a default. I'm actually a little
>surprised this works.
>
>> +        """
>>          # collect new branch entries
>>          newbranches = {}
>>          for c in ctxgen:
>> @@ -581,21 +591,55 @@
>>          for branch, newnodes in newbranches.iteritems():
>>              bheads = partial.setdefault(branch, [])
>>              bheads.extend(newnodes)
>> -            if len(bheads) <= 1:
>> -                continue
>> -            bheads = sorted(bheads, key=lambda x: self[x].rev())
>> -            # starting from tip means fewer passes over reachable
>> -            while newnodes:
>> -                latest = newnodes.pop()
>> -                if latest not in bheads:
>> -                    continue
>> -                minbhnode = self[bheads[0]].node()
>> -                reachable = self.changelog.reachable(latest, minbhnode)
>> -                reachable.remove(latest)
>> -                if reachable:
>> -                    bheads = [b for b in bheads if b not in reachable]
>> +            # Remove duplicates - nodes that are in newnodes and are
>>already in
>> +            # bheads.  This can happen if you strip a node whose
>>parent was
>> +            # already a head (because they're on different branches).
>> +            bheads = set(bheads)
>
>That comment's a bit excessive for a standard idiom.
>
>> +            # Remove candidate heads that no longer are in the repo
>>(e.g., as
>> +            # the result of a strip that just happened).  Avoid using
>>'bhead in
>> +            # self' here because that dives down into branchcache code
>>somewhat
>> +            # recrusively.
>
>recursively
>
>> +            bheads = [bhead for bhead in bheads \
>
>\ isn't needed here
>
>> +                          if self.changelog.hasnode(bhead)]
>> +            if len(bheads) > 1:
>> +                bheads = sorted(bheads, key=lambda x: self[x].rev())
>
>Here's some evidence that it might be better to do this whole thing in
>rev-space and convert back to node-space when we're done.
>
>Notably, ctxisnew is pretty trivial to determine from looking at revs:
>if the new nodes are all greater than max(bheadrevs), then it's true.
>And that hides this detail from the rest of the code.
>
>(The rule here, of course, is that if x is a descendant of y, x.rev() >
>y.rev(), which is the intrinsic rule of a topological ordering. Lots of
>algorithms can be simplified by taking advantage of the fact that we
>have a topological ordering handy. But you have to be careful: you can't
>infer any other relationships!)

For some reason I had the impression that sometimes descendants had lower
rev numbers.  Not sure where I got that from.
Yeah, if rev #'s of descendants are always higher, then I can simplify as
you suggest.

>
>> +                # starting from tip means fewer passes over reachable.
>> +                #
>> +                # If we know the new candidates are not ancestors of
>>existing
>> +                # heads, we don't have to examine ancestors of
>>existing heads
>> +                if ctxisnew:
>> +                    iternodes = sorted(newnodes, key=lambda x:
>>self[x].rev(),
>> +                                       reverse=True)
>> +                else:
>> +                    iternodes = sorted(bheads, key=lambda x:
>>self[x].rev(),
>> +                                       reverse=True)
>
>iternodes = bheads
>if ctxisnew:
>    iternodes = newnodes
>iternodes = sorted(iternodes,...)
>
>> +                # This loop prunes out two kinds of heads - heads that
>>are
>> +                # superceded by a head in newnodes, and newnodes that
>>are not
>> +                # heads because an existing head is their descendant.
>> +                while iternodes:
>> +                    latest = iternodes.pop()
>> +                    if latest not in bheads:
>> +                        continue
>> +                    minbhnode = self[bheads[0]].node()
>> +                    reachable = self.changelog.reachable(latest,
>>minbhnode)
>
>The reachable method is pretty unfortunate here as it's not a generator,
>and goes from node-space to rev-space and back, and it only works on one
>rev at a time. You could do this all in one pass with revlog.ancestors.
>It doesn't take a stop point, but since it's a generator, you can stop
>whenever you want.
>
>> +        # There may be branches that cease to exist when the last
>>commit in the
>> +        # branch was stripped.  This code filters them out.  Note that
>>the
>> +        # branch that ceased to exist may not be in newbranches because
>> +        # newbranches is the set of candidate heads, which when you
>>strip the
>> +        # last commit in a branch will be the parent branch.
>> +        for branch in partial.keys():
>> +            nodes = [head for head in partial[branch] \
>> +                         if self.changelog.hasnode(head)]
>> +            if len(nodes) < 1:
>> +                del partial[branch]
>
>This is mysterious to me. Let's suppose we have a history that looks
>like:
>
>a-b-b-b-b-b-b-b-a
>              ^ ^  <- branch heads
>
>..and we strip the rightmost a. How do we discover that there was an
>earlier a without visiting each changeset? In particular, note that we
>have no local information to distinguish from this case:
>
>b-b-b-b-b-b-b-b-a
>              ^ ^  <- branch heads
>
>I really don't believe we can handle all cases efficiently (ie by just
>looking at heads and the DAG).
>
>So I think the strategy here should be to incrementally fix one case at
>a time, while leaving the visit-all-history fallback for correctness.
>So:

Yeah, that's what we do.  This code is only called in the case when we
know the entire set of resulting branchheads.  For more complicated cases,
the slow path is executed.  The logic deciding that is in repair.py part
of the diff.

The block of code above is just removing entires from the hash table that
have arrays of zero length.  It's not making any subtle inferences (I
think).

I'll hopefully get an updated patch out soon - also addressing all your
other comments in it.

>
>- introduce ALL the tests (which should work correctly but slowly today)
>- convert to rev-based logic
>- introduce any external-to-case
>- fix case A
>- fix case B
>- fix case C
>- ...
>
>-- 
>Mathematics is the supreme nostalgia of our time.
>
>