[PATCH 0 of 1 ] Wrong distance calculation in revlog causes huge manifests

Sune Foldager cryo at cyanite.org
Mon Jul 16 11:45:41 CDT 2012


On 2012-07-12 08:04, Kastner Masilko, Friedrich wrote:

>So if I try to recap this in my own words, the intention is that a far away parent always has to introduce a full snapshot in order to let us read in as few bytes as possible. Worst case would be a commit - with parent revision 0 - that adds 1 byte to a 100MB (uncompressible) file, with the revlog of that file already being x > 300MB (if default window factor 2 is used). This would mean creating a x+100MB revlog for the 1 byte change to save an x > 300MB read for a 100MB file. Instead, just the 100MB must be read, because we can start at the proper offset and get everything with one read. With my patch, we'd start at 0 and have to read whatever x is, just to through away x-100MB bytes for 1 additional byte.
>
>With this limitation in mind, I guess it would make sense to convert a repository stored in generaldelta format with the date-sort option, so the chance of a far away parent is minimized. In the standard format, this was discouraged, but here it might yield better results. Maybe a kind of "shuffle" option could maximize the results, i.e. reordering commits by means of a round-robin kind of "scheduling" of parallel branches.

I implemented generaldelta a while back, although it's essentially based on an earlier idea "parentdelta".

When generaldelta is used, changesets are reordered when leaving the repository, i.e. when you pull from a GD-repository, using a heuristic which aims to both lessen the work the server has to do to convert deltas (since the bundle format doesn't support GD), but also improve the storage on non-GD clients.

It's basically a greedy heuristic which tries to put each revision after its first parent, if possible, and otherwise its second parent. It's the same basic code that was used in a repacker extension we had a while back. There might be rooms for improvements, but we can't spend too much time on the server.

>Thanks for the explanation, in light of this my patch is indeed incorrect. I'd suggest to reword the comments in the appropriate code section accordingly, though. From what it is now, I really got the impression that it is the chain-length that decides for a full snapshot, not the position in the file. But then it could just be me ;) .

Well I agree that those things are a bit implicit in the code.

-Sune


More information about the Mercurial-devel mailing list