[PATCH 0 of 1 ] Wrong distance calculation in revlog causes huge manifests

Kastner Masilko, Friedrich kastner-masilko at at.festo.com
Thu Jul 12 03:04:40 CDT 2012


> From: Matt Mackall [mailto:mpm at selenic.com]
> 
> If I understand what you've done here, this is probably incorrect.
> 
> In the non-generaldelta code, we bound both I/O and reconstruction time
> based on a small multiple of file size. Before this patch, we strictly
> bounded I/O, and thus reconstruction time. After this patch,
> reconstruction time is bound, but I/O is not.
> 
> Note that we intentionally do _a single read_ to pull in the entire
> span
> of deltas, including data we're skipping. This is important for
> performance on spinning media: reads are blocking and you have to wait
> for every read to complete before issuing another one. So if we're
> reconstructing a large delta chain, we could end up making hundreds to
> thousands of small round trips to the disk instead of one big, fast
> one.
> 
> So, with an unbounded I/O window, we could end up reading in a gigabyte
> of deltas (in one read) and then discarding most of them, just to
> reconstruct a megabyte of manifest data.

So if I try to recap this in my own words, the intention is that a far away parent always has to introduce a full snapshot in order to let us read in as few bytes as possible. Worst case would be a commit - with parent revision 0 - that adds 1 byte to a 100MB (uncompressible) file, with the revlog of that file already being x > 300MB (if default window factor 2 is used). This would mean creating a x+100MB revlog for the 1 byte change to save an x > 300MB read for a 100MB file. Instead, just the 100MB must be read, because we can start at the proper offset and get everything with one read. With my patch, we'd start at 0 and have to read whatever x is, just to through away x-100MB bytes for 1 additional byte.

With this limitation in mind, I guess it would make sense to convert a repository stored in generaldelta format with the date-sort option, so the chance of a far away parent is minimized. In the standard format, this was discouraged, but here it might yield better results. Maybe a kind of "shuffle" option could maximize the results, i.e. reordering commits by means of a round-robin kind of "scheduling" of parallel branches.

Thanks for the explanation, in light of this my patch is indeed incorrect. I'd suggest to reword the comments in the appropriate code section accordingly, though. From what it is now, I really got the impression that it is the chain-length that decides for a full snapshot, not the position in the file. But then it could just be me ;) .
 
> By contrast, git just mmaps its pack file and relies on the I/O
> scheduler to do the right thing. This sort-of works, by virtue of
> Linux's adaptive and aggressive read-ahead engine, but can degrade to
> worst-case random I/O behavior when the spacing between deltas in a
> chain is large. Also, it's safe to assume that mmap read-ahead works
> significantly less well on non-Linux systems.

Well, Git manages to work reasonably on Windows with this repository, especially regarding access times. Unfortunately, the fact that Mercurial inflates the manifest that seriously - even with the generaldelta format - makes it unusable for us in this case.

AFAIK, you once maintained a Git->Mercurial bridge on kernel.org , but it disappeared a while ago. Is there still something like this, and if so, what sizes do you see there? If not, do you use Git yourself to manage the Linux kernel? I'm asking because I don't have a reference of what size a Linux kernel Mercurial repository really should be. Somehow I suspect hg-git to be the reason for the blown-up manifest...

Sorry for the noise (@Bryan),
Fritz



Development Software Systems
Festo Gesellschaft m.b.H.
Linzer Strasse 227
Austria - 1140 Wien

Firmenbuch Wien
FN 38435y
UID: ATU14650108

Tel: +43(1)91075-198
Fax: +43(1)91075-282
www.festo.at

Der Inhalt dieses E-Mails ist ausschliesslich fuer den bezeichneten Adressaten bestimmt. Jede Form der Kenntnisnahme,
Veroeffentlichung, Vervielfaeltigung oder Weitergabe des Inhalts dieses E-Mails durch unberechtigte Dritte ist unzulaessig. Wir
bitten Sie, sich mit dem Absender des E-Mails in Verbindung zu setzen, falls Sie nicht der Adressat dieses E-Mails sind und das
Material von Ihrem Computer zu loeschen. 

This e-mail and any attachments are confidential and intended solely for the addressee. The perusal, publication, copying or
dissemination of the contents of this e-mail by unauthorised third parties is prohibited. If you are not the intended recipient of this
e-mail, please delete it and immediately notify the sender.



More information about the Mercurial-devel mailing list