[PATCH 0 of 1 ] Wrong distance calculation in revlog causes huge manifests

Sune Foldager cryo at cyanite.org
Mon Jul 16 11:50:04 CDT 2012


On 2012-07-12 13:21, Kastner Masilko, Friedrich wrote:
>> From: Benoit Boissinot [mailto:bboissin at gmail.com]
>>
>> You usually need some reordering in addition to general delta to take full advantage of the format.
>> I think the shrink stuff in contrib used to do that.
>
>If you mean shrink-revlog.py - originally written by yourself IIRC - I'm afraid that it is not helping here.
>I've tried both available orders of the shrink extension, but neither one resulted in any compression. I've
>tried it on the standard format repository and on the generaldelta format repository,  both with zero
>compression.

Yes, shrink-revlog uses essentially the same code as the reorder heuristic for outgoing changegroups when GD is used.

>I think I now understand why. The shrink extension tries to reorder a given revlog in such a way that the
>standard format's shortcoming (only diffing to the previous revision, not necessarily the parent) is
>compensated by minimizing the sub-optimal cases. This is done by "serializing" topological branches.

This is exactly the point. The (wire) bundle format doesn't support GD, so we want to minimize data size and server load.

>The hg-git extension seems to do this implicitly already, so the standard format will not benefit from
>another ordering. The generaldelta's supposed advantage (diffing to the actual parent) is actually minimized
>by such a strategy, because the generaldelta implementation has to take the I/O bounds into account, too, as
>I've learned today from Matt. So using the shrink extension on a generaldelta format seems to be even the
>worst thing to do, because it maximizes the gaps between branches, thus maximizing the probability to hit the
>I/O bounds limit.

I haven't performed extensive analysis, but it seems that shrinking (which can be accomplished simply by cloning with --pull a GD repo) doesn't really affect the repo size. If you want to experiment, you should be aware of the 'hg debugrevlog' command, which can be useful.

>Maybe implementing a third "order" into the shrink extension could help here. Instead of serializing the
>topological branches, it has to interleave them as much as possible, in order to not hit the I/O bounds
>window along the revlog.

This could help for storage, perhaps, but it would be awful for the wire bundle format. That said, we eventually want to replace the wire bundle format with one that supports GD.

-Sune


More information about the Mercurial-devel mailing list