size of repository with many branches, vs. git

Matt Mackall mpm at selenic.com
Sun Mar 30 15:33:11 CDT 2008


On Sun, 2008-03-30 at 22:23 +0300, Dov Feldstern wrote:
> Matt Mackall wrote:
> > On Sat, 2008-03-29 at 23:59 +0300, Dov Feldstern wrote:
> >> The conversion went well (with some help from Patrick), but the
> >> result 
> >> was disappointing to me: the size of the cloned repository is between 
> >> ~700MB (with no --datesort, converted in chunks of 1000 revisions at a 
> >> time) to ~1GB (with --datesort, which probably better reflects what 
> >> would happen over time as the project is tracked in real-time from svn). 
> >> By comparison, the entire git repository (freshly cloned) is only ~200MB!
> > 
> > Mercurial compression is suboptimal in the following ways:
> > 
> > - every working directory file in the history has a backing repository
> > file so the typical repository will grow by (filesystem block size *
> > number of files in history)/2
> 
> Ah, I guess that would explain why on two different machines (the one on 
> which conversion took place, and my local clone) the repositories vary 
> quite a bit in size (~1GB vs. ~1.4GB)? Both are ext3, but could it be 
> that they have different block sizes?

Basically all modern ext3 use a 4k block size, so I'd be surprised.
Perhaps you're using a really old hg somewhere that doesn't have
revlogng and thus doubles the file count?

> > - copies and renames store a full new revision at the target
> > - revlog storage is linear so interleaving of branches in a single
> > revlog reduces compression
> > 
> 
> I assume that this last point is what causes a lot of the trouble in my 
> case --- I guessed that something like that must be going on, when I saw 
> the difference between the datesort-ed and the non-datesort-ed repos. 
> And in LyX, we normally have two main branches (trunk and the latest 
> stable release), both of which are committed-to quite often (multiple 
> times a day), and the development-cycle of a release lasts about a year 
> or more, meaning the branch diverges from the trunk quite a bit over 
> this time period... not to mention some users who have personal branches...
> 
> > The last problem mostly appears in the manifest as it gets touched by
> > every commit on every branch. How many files are in your working dir,
> > how many files are in your store, how many changesets do you have, and
> > how big is your 00manifest.i?
> > 
> 
> Here are the numbers for my converted repository:
> *) working directory (actually, the number of files in the output of 'hg 
> manifest' on tip): 3281
> *) # of files in store: 84215 ('find .hg/store/data | wc -l')

Hmmm. Those two numbers are -very- different. For every file in the tip,
there are about 25 that once existed. Perhaps that means each file's
been renamed about 20 times?

Anyway, those 84k files account for about 172MB of filesystem overhead,
perhaps more. You can find out exactly by generating an uncompressed
tarball of .hg and comparing the result with du. 

> *) # of changesets: 21123
> *) size of 00manifest.i: 1351744 bytes
> *) size of 00manifest.d: 415874316 bytes! (i.e., about 35-40% of the 
> repository size is in this single file...)

Each manifest entry is taking about 20k compressed. In other words,
there are about 200 files changing in the average delta. Most of that is
bouncing between branches.

> So, are there any thoughts on improving any of these issues? Or am I 
> trying to use mercurial in a way that is sufficiently different than the 
> way it's intended to be used (storing many branches in a single 
> repository, tracking a foreign repository with these branches very 
> closely, ...)?

We've kicked around revlog patches that don't do strictly linear deltas.
That would help immensely here - probably a factor of 10x compression
improvement for the manifest. It's long past time we dusted those off.

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial mailing list