[RFC] [revlog] parent-delta

Matt Mackall mpm at selenic.com
Mon Aug 31 20:19:53 CDT 2009


On Tue, 2009-09-01 at 01:57 +0200, Benoit Boissinot wrote:
> After playing with the topo-sort algorithm for revlog shrinking [1],
> I've build a script to emulate the result of parent delta [2] on big
> revlogs.
> 
> The attached script can be used to dump information about a revlog in
> the following way:
> 
> $ python parent-delta.py dump .hg/store/00manifest.i > dump
> 
> This will create a file with some information about the compressed size
> of each revision, compressed size of the delta to each parent and the
> compressed size of the delta to tip.
> 
> Then with:
> 
> $ python parent-delta.py emul < dump
> 
> the script will output the estimated size of the revlog if we were using
> parent delta.
> 
> Currently the emulation of delta parent works in the following way:
> - it computes all the possibles deltas (parent1, parent2 and tip)
> - if the distance between the of the base delta chain is higher than a
>   certain threshold ("read factor", the value can be changed in the
>   source) it will discard the potential delta.
>   For example a threshold of 6 will only allow a given delta if the
>   distance between the base offset and the current offset is smaller
>   than 6 times the compressed full version.
> - from the remaining deltas, it chooses the smallest or insert a full
>   (compressed) version
> 
> Results with the linux kernel manifest (158k revs, about 29k files):
> 
> original manifest:
> 704803499 bytes (672 MiB)
> 
> dump is about 14MB and took some 4 hours to generate (I can try to put
> it online if some people are interested).
> 
> parent-delta manifest with a "read factor" of 6:
> 363655853 bytes ( 346.8 MiB)
> 
> parent-delta manifest with a "read factor of 4: 
> 464570388 bytes ( 443.0 MiB)

Hmm, that's a bit disappointing. In the limit of high read factors, this
should do better than toposort (better because at infinity, no full
versions are needed). But I was hoping that we'd converge on toposort
compression levels with read factors of around this size. 

In highly linear manifest graphs, runs of thousands of small deltas are
between full versions are common, even with an 'uncompressed read
factor' of only 2. Unfortunately, the kernel has evolved a development
style where it regularly holds off merges for months, then merges 10k
csets from dozens of different branches all at once, so each branch
merge has a large revision gap to the conceptually nearer of its
parents. 

How large does the read factor have to be to be comparable to toposort?
It'd be interesting to get 4 or 5 points on this curve (2, 4, 6, 10,
16?). I suspect at some factor we'll get to a point where the 'merge
window' fits inside the 'read factor' and compression will suddenly jump
substantially.

Note here that the compression of revision x can depend pretty
sensitively on the compression of revision x-1, especially if they're on
different branches. If x-1 gets stored as a full revision, this might
force x, x+1, x+2 to store full revisions as well, as their base
revisions are now too far away for delta storage.

> Notes about read factor:
> In our current implementation of tip-delta, the read factor was given in
> term of the uncompressed size of the full rev. Can someone give some
> insights to explain if using the compressed size is more (or less)
> meaningful?

I chose uncompressed so that the bound would be a (small) constant
multiple of the uncompressed filesize, in other words O(1). Also, we
don't have the compressed full size readily available in
revlog.addrevision because we compute it (expensive) after deciding not
to use a delta (the common case). 

Also note that it might make sense to make read factor a tunable. On
SSD, seek time and transfer rate are much smaller issues than disk space
usage (at least for now).

-- 
http://selenic.com : development and support for Mercurial and Linux




More information about the Mercurial-devel mailing list