[RFC] [revlog] parent-delta

Benoit Boissinot benoit.boissinot at ens-lyon.org
Mon Aug 31 18:57:22 CDT 2009


After playing with the topo-sort algorithm for revlog shrinking [1],
I've build a script to emulate the result of parent delta [2] on big
revlogs.

The attached script can be used to dump information about a revlog in
the following way:

$ python parent-delta.py dump .hg/store/00manifest.i > dump

This will create a file with some information about the compressed size
of each revision, compressed size of the delta to each parent and the
compressed size of the delta to tip.

Then with:

$ python parent-delta.py emul < dump

the script will output the estimated size of the revlog if we were using
parent delta.

Currently the emulation of delta parent works in the following way:
- it computes all the possibles deltas (parent1, parent2 and tip)
- if the distance between the of the base delta chain is higher than a
  certain threshold ("read factor", the value can be changed in the
  source) it will discard the potential delta.
  For example a threshold of 6 will only allow a given delta if the
  distance between the base offset and the current offset is smaller
  than 6 times the compressed full version.
- from the remaining deltas, it chooses the smallest or insert a full
  (compressed) version

Results with the linux kernel manifest (158k revs, about 29k files):

original manifest:
704803499 bytes (672 MiB)

dump is about 14MB and took some 4 hours to generate (I can try to put
it online if some people are interested).

parent-delta manifest with a "read factor" of 6:
363655853 bytes ( 346.8 MiB)

parent-delta manifest with a "read factor of 4: 
464570388 bytes ( 443.0 MiB)

Notes about read factor:
In our current implementation of tip-delta, the read factor was given in
term of the uncompressed size of the full rev. Can someone give some
insights to explain if using the compressed size is more (or less)
meaningful?

Notes about the topo-sort+parent-delta:
I need to check the percentage of parent-delta that gets inserted (vs.
full revs or tip-delta). But if they are high enough and if we change
the wire protocol to allow parent-delta as well, it would be much
cheaper to do topo-sort on the fly since deltas would not be recomputed
for every rev (as is the case today if we try to do topo-sort during
pull).

Cheers,

Benoit

[1] http://mercurial.markmail.org/search/?q=shrink%20manifest
[2] http://mercurial.selenic.com/wiki/ParentDeltaPlan
-- 
:wq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: parent-delta.py
Type: text/x-python
Size: 2225 bytes
Desc: not available
Url : http://selenic.com/pipermail/mercurial-devel/attachments/20090901/1fcfac54/attachment.py 


More information about the Mercurial-devel mailing list