shrink-revlog bakeoff: preliminary results

Benoit Boissinot benoit.boissinot at ens-lyon.org
Wed Feb 10 04:18:48 CST 2010


On Thu, Feb 04, 2010 at 09:20:12PM -0500, Greg Ward wrote:
> Executive summary:
>   * parentdelta and reversepostorder appear to be equivalent, both
> from experiment and from working through the algorithms on paper (I
> suspect a *real* computer scientist could prove that they are
> equivalent, but I'm more of a gut instinct computer scientist)
>   * parentdelta and reversepostorder consistently perform very well,
> and are both stable sorts
>   * branchsort is sometimes a little worse than
> parentdelta/reversepostorder, and sometimes much worse
>   * postorderreverse is unstable, therefore not recommended.  But it
> sometimes performs markedly better than parentdelta/reversepostorder,
> so we might want to keep it around for future examination and
> experiment
> 
> My opinion:
>   * we should add the ability for pluggable shrink algorithms so that
> it's easy to repeat these kind of experiments in future (I've already
> posted the patches, and will repost soon once I've tweaked them a bit)
>   * we should pick one of parentdelta and reversepostorder as the
> default algorithm and ditch the other.  No need to have equivalent
> algorithms kicking around.
>   * we should remove branchsort, since it is clearly inferior to
> parentdelta/reversepostorder
>   * we might want to keep postorderreverse in our back pockets: in its
> current form it is unsuitable because of its instability, but perhaps
> it's possible to improve and make it more consistent.  (It
> demonstrates that it is possible to do better than parentdelta on at
> least some data.)

It is very easy to do so. Postorder is stable, since the parents are
always taken in the same order, for postorder reverse we just need to
find a stable way to pick children (sorting by hashes is a (probably
not interesting) way to do it).

But better would be to find out what happens when it behaves better.
> 
> I have a slight preference for parentdelta, but I'm biased: I wrote
> it, so of course I find it easier to understand!  Someone other than
> Benoit or me should take a look at the code and see which one makes
> more sense.  They're both pretty subtle algorithms that demand careful
> reading, pencil-and-paper working out, and a background in computer
> science.  ;-)  Performance is irrelevant, since shrink-revlog runtime
> is totally dominated by the time to rewrite the manifest log.

I had hoped that the fact I took the textbook algorithm (you'll
find the same algorithm in e.g. wikipedia) and that it is smaller would
convince you ;)


cheers,

Benoit
-- 
:wq


More information about the Mercurial-devel mailing list