[PATCH 0 of 1 ] Wrong distance calculation in revlog causes huge manifests

Friedrich Kastner-Masilko kastner_masilko at at.festo.com
Wed Jul 11 05:54:12 CDT 2012


Hy there,

recently I had the opportunity (or the need, depends how you see it) to convert a recent Linux kernel
repository from Git to Mercurial via Hg-Git. I did so on our company server with plenty of RAM and cores to
spare, so I didn't worry about size or duration. The Mercurial version is 2.0.1, Hg-Git is 0.3.2 .

The Git repository has a total size of about 650MB (without working copy, of course). After about 8 hours
conversion with Hg-Git, the Mercurial repository had a size of 5GB. It was interesting to see that the
manifest data file alone topped out at 4.1GB, the data directory itself is a reasonable 960MB. All this
without generaldelta format.

I then converted the resulting HG repo via "hg clone -U --config format.generaldelta=1 kernel kernel2" to the
generaldelta format, in the hope of seeing a major reduction in manifest size. Unfortunately, it still created
a 3.4GB repository, with manifest data again dominating with 2.4GB and data directory at 883MB.

I decided to take a look at the manifest index, and noticed very frequent full snapshots, despite the changes
leading to this snapshots being minimal. All the occurances used a far away parent, though, therefore I took a
look into the code to see what's going on. I found the distance calculation in revlog's
_addrevision.builddelta(rev) to be wrong.

After applying the attached patch, another conversion to generaldelta format resulted in a 1.7GB repository
with 691MB manifest and 867MB data. Now this overhead to Git's aggressive pack mechanism I would have
expected, anyway.

Unfortunately, I did not time the conversions, so it could very well be that the patch introduced a serious
time cost. I doubt it, because it follows the same algorithm that chainbase uses and is fairly
straightforward.

OTOH I did not see much use of the chainbase function itself, so maybe it could be refactored to also return
the length in order to save another chain iteration in chainlength. That said, the current implementation is
safe in regards to legacy behaviour, as it won't affect the standard repository format.

Please excuse my ignorance if this has been brought up before, I don't follow the devel list too much.

regards,
Fritz


More information about the Mercurial-devel mailing list