size of repository with many branches, vs. git

Dov Feldstern dfeldstern at fastimap.com
Mon Mar 31 16:46:44 CDT 2008


(in which mercurial is exonerated, so read on!)

So yesterday I discovered that part of the problem was that this one 
branch --- personal --- actually contains subdirectories with lots of 
branches in them; but was being converted as a single, very large 
branch, which was obviously very different than the other branches, and 
with multiple copies of each file in it.

So here's what I tried: first, I cloned each branch (excluding personal) 
into a separate repository (with hg clone --rev), so I now had 23 
repositories. Their sizes are in Appendix 1 (interestingly, the 
repositories were apparently not hardlinked? --- which made life easier 
getting the sizes with du). The total size of all the branches came out 
to 914452 KB, which is still not great.

Next, I cloned the default branch into a new repository, and then pulled 
all the other branches into it, one at a time. So I now had a single 
repository of all the branches, but in which the revisions were 
basically as linear as possible. The size of this repository on disk was 
  159608 KB, which is much much better!

Finally, to complete our tests, I converted this new repository into yet 
another one, with --datesort. And here's where I was really surprised: 
despite the datesorting, which causes quite a bit of interleaving 
between branches, the size of the datesorted repository is only 172616 KB!

So it seems that the non-standard branches layout in the original svn 
repository really was the culprit (recall, the size I started out with 
was ~1GB (datesorted) ). It's really worth noting --- especially for 
anyone converting to mercurial --- how sensitive mercurial is to changes 
in the layout of the tree between revisions. I guess that also just 
moving files from one place to another in one branch and not in another, 
would cause similar issues? Interestingly, git seems to have dealt with 
this well, even though it also converted personal as a single branch: 
the git repository is only 157300 KB, and that includes personal 
(although as a single branch). I guess this is where tracking content 
vs. tracking files really makes a difference...

Now I'm still a bit stuck with LyX, because I don't have any way to keep 
the converted repository up-to-date, without pulling personal in again. 
It would be really great if the convert extension would allow me to 
specify full paths within the original svn repository to each of the 
branches I want to convert, and/or to specify only those branches which 
I want to include or ignore. (Of course, this means that I would miss 
new branches which may be added later on, until I'd update the 
branch-path-map manually... but I don't see any way of doing this 
automatically; unless the svn repo itself stores information about which 
paths are actual copies of the trunk, as opposed to just directories 
created manually?).

Finally, just in case you're still interested, I'm adding the 
revlogstats output for both of the all-branches-in-one-repositories --- 
the datesorted and the non-datesorted. it may still be worth pursuing 
this approach...

Again, thanks everyone for your help, and I hope that this can supply 
some constructive insights to others that may be having trouble with 
repository size...

Dov


Appendix 1: size (on disk) of each branch in a separate repository
------------------------------------------------------------------
3884    LyX-Team
4104    string-switch
4268    pathswitch
4312    debugstream
4980    runlatex
7560    rae
9008    dialogbase
12396   lyx-1_1_5
15120   obsolete
17436   BRANCH_new_insets
17736   BRANCH_1_1_6
18440   BRANCH_MVC
20884   BRANCH_NATBIB
33288   BRANCH-1_2_X
33380   BRANCH_GUII
49464   BRANCH_NOUPDATE
67552   CoordBranch
72736   BRANCH_1_3_X
81316   BooktabBranch
86824   gtkdevel
87464   BRANCH_1_4_X
127348  BRANCH_1_5_X
134952  default

914452  total


Appendix 2: revlogstats for the non-datesorted, all-branches-in-one 
repository:
-------------------------------------------------------------------
0 0 0
1000 664906 536408 80.67
2000 1215181 948095 78.02
3000 1557129 1239030 79.57
4000 1999908 1578821 78.94
5000 2420433 1920701 79.35
6000 2862426 2275816 79.51
7000 3427959 2780318 81.11
8000 3930118 3187958 81.12
9000 4281668 3430958 80.13
10000 4671308 3699924 79.21
11000 4953131 3981747 80.39
12000 5221487 4141631 79.32
13000 5641740 4451862 78.91
14000 5835105 4645227 79.61
15000 6199522 4897644 79.00
16000 6434720 5132842 79.77
17000 6714012 5307982 79.06
18000 7028967 5484834 78.03
19000 8010598 5923646 73.95
19129 8268318 6069547 73.41


Appendix 3: revlogstats for the datesorted, all-branches-in-one repository:
-------------------------------------------------------------------
0 0 0
1000 1396350 609093 43.62
2000 5239618 1139758 21.75
3000 5633760 1433992 25.45
4000 7001082 1752538 25.03
5000 11773611 2102094 17.85
6000 12207434 2453218 20.10
7000 12713697 2806189 22.07
8000 13400747 3337689 24.91
9000 14123493 3703632 26.22
10000 14541697 3952737 27.18
11000 14936154 4215431 28.22
12000 15243602 4471724 29.34
13000 15515288 4635210 29.88
14000 15907699 4920107 30.93
15000 16117844 5130252 31.83
16000 16476146 5376603 32.63
17000 16691398 5591855 33.50
18000 17026828 5781577 33.96
19000 17433378 5969085 34.24
19130 18317979 5991407 32.71



More information about the Mercurial mailing list