shrink-revlog bakeoff: preliminary results
Greg Ward
greg-hg at gerg.ca
Thu Feb 4 20:20:12 CST 2010
As threatened a while ago, I've started doing a comprehensive
comparison of the various competing "shrink-revlog" algorithms. You
may recall that the contenders are:
1) branchsort, the original shrink-revlog algorithm (also the default
toposort used by hg convert)
2) parentdelta, written by me in an attempt to mimic what the manifest
log would look like if Mercurial computed deltas relative to the first
parent rather than the previous revision
3) reversepostorder, Benoit Boissonot's rewrite of parentdelta as a
post-order iteration of the reverse-directed graph
4) postorderreverse, Benoit's other idea (reverse post-order of the
reverse-directed graph)
Executive summary:
* parentdelta and reversepostorder appear to be equivalent, both
from experiment and from working through the algorithms on paper (I
suspect a *real* computer scientist could prove that they are
equivalent, but I'm more of a gut instinct computer scientist)
* parentdelta and reversepostorder consistently perform very well,
and are both stable sorts
* branchsort is sometimes a little worse than
parentdelta/reversepostorder, and sometimes much worse
* postorderreverse is unstable, therefore not recommended. But it
sometimes performs markedly better than parentdelta/reversepostorder,
so we might want to keep it around for future examination and
experiment
My opinion:
* we should add the ability for pluggable shrink algorithms so that
it's easy to repeat these kind of experiments in future (I've already
posted the patches, and will repost soon once I've tweaked them a bit)
* we should pick one of parentdelta and reversepostorder as the
default algorithm and ditch the other. No need to have equivalent
algorithms kicking around.
* we should remove branchsort, since it is clearly inferior to
parentdelta/reversepostorder
* we might want to keep postorderreverse in our back pockets: in its
current form it is unsuitable because of its instability, but perhaps
it's possible to improve and make it more consistent. (It
demonstrates that it is possible to do better than parentdelta on at
least some data.)
I have a slight preference for parentdelta, but I'm biased: I wrote
it, so of course I find it easier to understand! Someone other than
Benoit or me should take a look at the code and see which one makes
more sense. They're both pretty subtle algorithms that demand careful
reading, pencil-and-paper working out, and a background in computer
science. ;-) Performance is irrelevant, since shrink-revlog runtime
is totally dominated by the time to rewrite the manifest log.
Anyways, here are detailed results on shrinking the manifest log for
two real-life repos: hg-crew and intelepacs. (That's my employer's
main product, which I recently converted from a branch-mad CVS repo to
a branch-mad hg repo; the tests were done on the immediate
post-CVS-conversion data). I'll do the same for python just as soon
as Dirkjan gets me a new conversion.
First, shrinking hg-crew with the four algorithms:
MiB shrinkage
0) original (unshrunk) 5.3 1.0
1) branchsort 2.8 1.8
2) parentdelta 2.5 2.1
3) reverse postorder 2.5 2.1
4) postorder reverse 4.9 1.1
Conclusion: parentdelta/reversepostorder have a slight edge.
Postorderreverse need not apply.
Next, intelepacs:
MiB shrinkage
0) original (unshrunk) 2636.1 1.0
1) branchsort 467.1 5.6
2) parentdelta 32.9 80.1
3) reverse postorder 32.9 80.1
4) postorder reverse 23.6 111.7
Oh yeah: this is with most almost all CVS branches merged as part of
the conversion. (Either CVS merges were detected and turned into
Mercurial merges, or failing that I just added dummy merges to avoid a
40-head repo.) Without the dummy merges, branchsort turned in a
respectable manifest log of ~55 MB. But parentdelta/reversepostorder
still kicks its you-know-what.
Curiously, postorder reverse really wins big here. But it's
inconsistent (see above) and wildly unstable: if I run it on the
just-shrunk 23.6 MiB manifest log, it blows up to 652 MiB!
Greg
More information about the Mercurial-devel
mailing list