Partial revlog shrinking

Greg Ward greg at gerg.ca
Sat Apr 17 14:34:10 CDT 2010


Here's some interesting data:

  Jan 18:
    converted 9 years of CVS history (~105,300) changesets to hg
    manifest log was ~3.5 GB after conversion
    shrink-revlog squashed that down to ~31 MB: big win!
    that's 300 bytes of manifest log per changeset

  Apr 16:
    in 3 months of Mercurial activity, we're at almost 108,000 changesets
    specifically, we've added 2569 changesets in 89 calendar days
    (which is scarily close to an average of 1 changeset per developer per day)
    of those 2569 changesets, ~1300 are cross-branch merges
    and our manifest log is up to 166 MB
    that's 52432 bytes of manifest log per changeset

Conclusion #1: changing to Mercurial has not changed our most common
workflow practice, which is to fix a bug on the earliest stable
release branch where it makes business sense to do so and then merge
forwards to the trunk ("default").  Since we have paying customers
running a wide variety of different branches, this means that it's
quite common for a fix to be merged to 3 or 4 branches -- this happens
several times per day.  Once or twice a week we have a fix merged
across 10-12 branches.  I guess that's just how things pan out when
you're selling a large, complex, expensive piece of software to
medical clinics who are conservative about upgrades.

Conclusion #2: my relationship with shrink-revlog is not over.  I
*could* just do a full run on the entire manifest now and I'm sure it
would shrink down quite a lot.  But that's awfully wasteful: it means
reading and writing 105375 manifest logs that will not be reordered in
order to reorder the 2569 that really need sorting (shrinking).  And I
would like to give my users a command they can run on their
now-much-bigger-than-necessary working repos to shrink them; if this
command takes an hour when it should take a minute or two, that's
rather silly.

So I'd like to implement partial revlog shrinking.  The idea is this:

  hg shrink -r REVNUM

would copy revisions 0..(REVNUM-1) verbatim from the source revlog
with no resorting.  It would just copy the bytes, since
decoding/encoding all those revlog entries is what really makes
shrink-revlog take a long time.  Then it would sort and rewrite
revisions REVNUM..max.

Additionally, it would be nice to have a "carry on where I left off
feature".  That is, shrink-revlog could write the revnum where its
last run finished, and then optionally carry on from there on the next
run.  E.g.

  hg shrink

would reorder and rewrite the whole manifest log, writing the revnum
of tip to .hg/shrink-lastrev.  Then, a few days later, run

  hg shrink --continue

as shorthand for

  hg shrink -r `cat .hg/shrink-lastrev`

Questions:

  * am I risking a horrible mess by reordering the manifest log for a
repo that's in live use across dozens of developer workstations, build
servers, etc?  I think I'm safe because the changelog references the
manifest log by node ID, not by revnum.  But if I'm missing some
subtle detail, please tell me!

  * the "-r" option here would only accept a decimal revision number
*in the revlog being shrunk*, not an arbitrary changeset identifier.
Is it therefore bad to use "-r" here?  Or can I get away with it
because shrink is a rather special-purpose extension that is probably
only used by people who understand what that sentence means?

  * what do you think of the .hg/shrink-lastrev and --continue
feature?  it's orthogonal to -r and I'll implement it as a separate
patch, of course.

Thanks --

Greg


More information about the Mercurial-devel mailing list