size of repository with many branches, vs. git

Sun Mar 30 16:18:58 CDT 2008

Matt Mackall wrote:
> On Sun, 2008-03-30 at 15:33 -0500, Matt Mackall wrote:
>> On Sun, 2008-03-30 at 22:23 +0300, Dov Feldstern wrote:
>>> Matt Mackall wrote:
>>>> On Sat, 2008-03-29 at 23:59 +0300, Dov Feldstern wrote:
>>>>> The conversion went well (with some help from Patrick), but the
>>>>> result 
>>>>> was disappointing to me: the size of the cloned repository is between 
>>>>> ~700MB (with no --datesort, converted in chunks of 1000 revisions at a 
>>>>> time) to ~1GB (with --datesort, which probably better reflects what 
>>>>> would happen over time as the project is tracked in real-time from svn). 
>>>>> By comparison, the entire git repository (freshly cloned) is only ~200MB!
>>>> Mercurial compression is suboptimal in the following ways:
>>>>
>>>> - every working directory file in the history has a backing repository
>>>> file so the typical repository will grow by (filesystem block size *
>>>> number of files in history)/2
>>> Ah, I guess that would explain why on two different machines (the one on 
>>> which conversion took place, and my local clone) the repositories vary 
>>> quite a bit in size (~1GB vs. ~1.4GB)? Both are ext3, but could it be 
>>> that they have different block sizes?
>> Basically all modern ext3 use a 4k block size, so I'd be surprised.
>> Perhaps you're using a really old hg somewhere that doesn't have
>> revlogng and thus doubles the file count?
>>
>>>> - copies and renames store a full new revision at the target
>>>> - revlog storage is linear so interleaving of branches in a single
>>>> revlog reduces compression
>>>>
>>> I assume that this last point is what causes a lot of the trouble in my 
>>> case --- I guessed that something like that must be going on, when I saw 
>>> the difference between the datesort-ed and the non-datesort-ed repos. 
>>> And in LyX, we normally have two main branches (trunk and the latest 
>>> stable release), both of which are committed-to quite often (multiple 
>>> times a day), and the development-cycle of a release lasts about a year 
>>> or more, meaning the branch diverges from the trunk quite a bit over 
>>> this time period... not to mention some users who have personal branches...
>>>
>>>> The last problem mostly appears in the manifest as it gets touched by
>>>> every commit on every branch. How many files are in your working dir,
>>>> how many files are in your store, how many changesets do you have, and
>>>> how big is your 00manifest.i?
>>>>
>>> Here are the numbers for my converted repository:
>>> *) working directory (actually, the number of files in the output of 'hg 
>>> manifest' on tip): 3281
>>> *) # of files in store: 84215 ('find .hg/store/data | wc -l')
>> Hmmm. Those two numbers are -very- different. For every file in the tip,
>> there are about 25 that once existed. Perhaps that means each file's
>> been renamed about 20 times?
>>
>> Anyway, those 84k files account for about 172MB of filesystem overhead,
>> perhaps more. You can find out exactly by generating an uncompressed
>> tarball of .hg and comparing the result with du. 
>>
>>> *) # of changesets: 21123
>>> *) size of 00manifest.i: 1351744 bytes
>>> *) size of 00manifest.d: 415874316 bytes! (i.e., about 35-40% of the 
>>> repository size is in this single file...)
>> Each manifest entry is taking about 20k compressed. In other words,
>> there are about 200 files changing in the average delta. Most of that is
>> bouncing between branches.
>>
>>> So, are there any thoughts on improving any of these issues? Or am I 
>>> trying to use mercurial in a way that is sufficiently different than the 
>>> way it's intended to be used (storing many branches in a single 
>>> repository, tracking a foreign repository with these branches very 
>>> closely, ...)?
>> We've kicked around revlog patches that don't do strictly linear deltas.
>> That would help immensely here - probably a factor of 10x compression
>> improvement for the manifest. It's long past time we dusted those off.
> 
> Here's a quick test to see what we might stand to gain from non-linear
> deltas on your repo:
> 
> -----
> #!/usr/bin/python
> 
> import sys
> from mercurial import revlog
> 
> r = revlog.revlog(open, sys.argv[1])
> frags = []
> total = 0
> optimized = 0
> 
> for i in range(r.count()):
>     n = r.node(i)
>     c = r.length(i)
>     total += c
>     o = [c]
>     for p in r.parentrevs(i):
>         if p != -1:
>             d = revlog.compress(r.revdiff(p, i))
>             o.append(len(d[0]) + len(d[1]))
>     optimized += min(o)
>     if not i % 1000:
>         print i, total, optimized, "%5.2f" % (100.0 * optimized/total)
> 
> print i, total, optimized, "%5.2f" % (100.0 * optimized/total)
> -----
> 
> Just run "python revlogstat .hg/store/00manifest.i"
> 
> On Mercurial itself, we get:
> 0 453 453 100.00
> 1000 185808 131249 70.64
> 2000 415933 244696 58.83
> 3000 744285 367411 49.36
> 4000 1022945 490668 47.97
> 5000 1710896 614376 35.91
> 6000 2355557 741334 31.47
> 6388 2617967 799271 30.53
> 
> ie we save nearly 70% of the manifest size. This is only an
> approximation, because it doesn't include insertion of full revisions at
> intervals to keep extraction time bounded.
> 

So here are the numbers for my repository:

0 0 0   (I added an if to not print the ratio if total == 0)
1000 2724549 670786 24.62
2000 8913458 1245021 13.97
3000 11583394 1529539 13.20
4000 13688481 1847968 13.50
5000 20006385 2194606 10.97
6000 27620817 2534312  9.18
7000 36007696 2882225  8.00
8000 46311519 3359232  7.25
9000 58804766 3795506  6.45
10000 76475314 4092805  5.35
11000 89139999 5240948  5.88
12000 182764754 7836535  4.29
13000 221057361 9123112  4.13
14000 232451389 9528460  4.10
15000 260941040 10060897  3.86
16000 300371968 10876508  3.62
17000 338490690 11752122  3.47
18000 368879711 12185596  3.30
19000 390890090 12457084  3.19
20000 407195952 12754832  3.13
21000 415139662 12940260  3.12
21120 415874316 12961588  3.12

So we'd be saving almost 97%?!