Something changed shrink-revlog for the worse

Greg Ward greg at gerg.ca
Mon Jan 11 09:34:31 CST 2010


On Sun, Jan 10, 2010 at 1:36 PM, Greg Ward <greg at gerg.ca> wrote:
> Conclusion: there exist small perturbations to the source revlog that
> cause large changes in the size of shrink-revlog's output.  I'm going
> to go spend some pencil-and-paper time now to see if there exists a
> tweak to the toposort() algorithm that does not exhibit this
> behaviour.

Got it!  I wasted a couple of hours yesterday trying to come up with
little heuristic tweaks to the current toposort algorithm, but then
had a brainwave late last night: if the idea is to avoid "suboptimal"
manifest nodes, i.e. nodes whose predecessor is not their first
parent, then start at the heads and work backwards.  Then there's no
guessing required: you just follow the trail of first parents for as
long as you can.

I implemented this algorithm this morning and it's a big win.

Repo #1: dummy merges do not "close the board down", so current
toposort works pretty well, shrinking the manifest log from 2638 MB to
57 MB.  My new toposort shrinks to 52 MB: nice.  I was hoping it would
be no worse, and am pleasantly surprised that it's a bit better.

Repo #2: 30 dummy merges interrupt the flow of the trunk and reduce
toposort's options.  Current toposort shrinks from 2638 MB to 468 MB:
yuck.  My new toposort shrinks to 32 MB: awesome!!  I really did not
expect this; I thought it would be much better than 468 MB, but really
did not expect it to do *better* than repo #1.  A pleasant surprise
indeed.

Here is a rough patch.  I'd like some feedback on the algorithm first,
then I'll send a real patch.

description:
shrink: new toposort() which tries to avoid 'suboptimal' nodes.

A 'suboptimal' node is one whose first parent is not its predecessor,
i.e. one where revlog will compute a possibly large delta.  The
theory is that minimizing suboptimal nodes will result in a space-
optimal sort.  I have no evidence for this, but it feels right.


diff -r eb5bccc2d412 -r 3374976436bf contrib/shrink-revlog.py
--- a/contrib/shrink-revlog.py	Mon Jan 11 09:09:12 2010 -0500
+++ b/contrib/shrink-revlog.py	Mon Jan 11 10:32:34 2010 -0500
@@ -103,6 +103,77 @@
     ui.write('%d p2 is predecessor\n' % len(p2pred))
     return ret

+def toposort(ui, rl):
+    # map rev to list of parent revs (p2 first)
+    parents = {}
+    children = {}
+    ui.write('reading %d revs ' % len(rl))
+    try:
+        for rev in rl:
+            (p1, p2) = rl.parentrevs(rev)
+            if p1 == p2 == node.nullrev:
+                parents[rev] = ()       # root node
+            elif p1 == p2 or p2 == node.nullrev:
+                parents[rev] = (p1,)    # normal node
+            else:
+                parents[rev] = (p2, p1) # merge node
+
+            children[rev] = []
+
+            for p in parents[rev]:
+                assert p in children
+                children[p].append(rev)
+
+            if rev % 1000 == 0:
+                ui.write('.')
+    finally:
+        ui.write('\n')
+
+    # start at the heads and work backwards: this lets us deliberately
+    # avoid suboptimal nodes, i.e. nodes whose predecessor is not
+    # their first parent
+    ui.write('sorting ...')
+    stack = map(rl.rev, rl.heads())
+    stack.sort()                        # will pop tip first
+    pop = stack.pop
+    push = stack.append
+
+    suboptimal = 0
+    p2pred = 0
+
+    result = []                         # reverse topo sort
+    while stack:
+        cur = pop()
+
+        # if cur is not the first parent of its successor, then the
+        # successor is a suboptimal node
+        if result:
+            successorparents = rl.parentrevs(result[-1])
+            if successorparents[0] != cur:
+                suboptimal += 1
+            if successorparents[1] != node.nullrev and
successorparents[1] == cur:
+                p2pred += 1
+
+        result.append(cur)
+
+        for p in parents.pop(cur):
+            skip = False
+            for c in children[p]:
+                if c in parents:        # have not seen c yet
+                    skip = True
+                    break
+            if skip:
+                continue
+
+            push(p)
+
+    ui.write('\n')
+    ui.write('%d suboptimal\n' % suboptimal)
+    ui.write('%d p2 is predecessor\n' % p2pred)
+
+    result.reverse()
+    return result
+
 def writerevs(ui, r1, r2, order, tr):

     ui.write('writing %d revs ' % len(order))



Greg


More information about the Mercurial-devel mailing list