size of repository with many branches, vs. git

Sun Mar 30 18:12:36 CDT 2008

On Mon, 2008-03-31 at 00:24 +0200, Benoit Boissinot wrote:
> On Sun, Mar 30, 2008 at 11:04 PM, Matt Mackall <mpm at selenic.com> wrote:
> >  Here's a quick test to see what we might stand to gain from non-linear
> >  deltas on your repo:
> >
> >
> >  ie we save nearly 70% of the manifest size. This is only an
> >  approximation, because it doesn't include insertion of full revisions at
> >  intervals to keep extraction time bounded.
> >
> >
> 
> I modified it a bit to always do a delta in the delta vs tip case and
> I got (for the manifest of the kernel repo):

I'm not sure what you mean. All revisions are deltas against parent,
with no full revisions?

> 85000 226463816 43403672 19.17
> 86000 228332602 43588129 19.09
> 87000 230034416 43800074 19.04
> 88000 232135027 44029133 18.97
>      88247 232664773 44065644 18.94

I got basically the same result with the script I posted. Looks like the
full revisions account for a fair amount of the size.

I've written a modified version that actually puts in the full
revisions. It has an adjustable "factor" controlling the maximum amount
of data to reconstruct a revision. In stock Mercurial, it's 2. In the
above tests, it's infinite. With it set to 4 (and deltas against p1
only), I get the following on the kernel repo:

88011 283551764 64355029 22.70
103 11 82010 5991

5991 = number of nonlinear deltas (~7%)
103 = number of full revisions in the original
11 = number in the optimized version

So, not far from the asymptotic maximum, with 2x the current I/O.

One thing to note here is that we don't want deltas to pass the previous
full revision, otherwise we'll effectively have multiple interleaved
revlogs with their effective factors divided by the number of parallel
streams. I tested that and the result turned out to be worse!

(Also, because most of these deltas are still linear, we may be able to
get most of the benefits of this without modifying the wire protocol. Or
maybe not.)

Here's my new script:

----
#!/usr/bin/python

import sys
from mercurial import revlog

r = revlog.revlog(open, sys.argv[1])
frags = []
total = 0
optimized = 0
factor = 4
origfull = newfull = 0
linear = 0
parent = 0
lastfull = -1000000000000

for i in range(r.count()):
    n = r.node(i)
    d = r.length(i)
    if r.base(i) == i:
        origfull += 1
    total += d
    o = [d]
    p = r.parentrevs(i)[0]
    if optimized == lastfull:
        # we don't want to do a delta past the last full revision
        p = i - 1
    pd = revlog.compress(r.revdiff(p, i))
    d = len(pd[0]) + len(pd[1])
    if d + (optimized - lastfull) > r.size(i) * factor:
        fd = revlog.compress(r.revision(n))
        d = len(fd[0]) + len(fd[1])
        newfull += 1
        optimized += d
        lastfull = optimized
    else:
        if p != i-1:
            parent += 1
        else:
            linear += 1
        optimized += d

    if not i % 1000:
        print i, origfull, newfull, "%5.2f" % (100.0 * optimized/total)

print i, original, newfull, , "%5.2f" % (100.0 * optimized/total)
print total, optimized, linear, parent
----

> I tried sorting the revs to be linear and do a delta vs tip:

And here again. You've optimized the revision order and then did only
deltas?

> 85000 226463824 110756686 48.91
> 86000 228332610 111076561 48.65
> 87000 230034424 111751755 48.58
> 88000 232135035 112312502 48.38
>      88247 232664781 112395752 48.31

-- 
Mathematics is the supreme nostalgia of our time.