On compressing revlogs

Mon Jun 4 16:56:27 CDT 2012

On Mon, 2012-06-04 at 14:07 -0700, Bryan O'Sullivan wrote:
> Lately, I've come to target zlib as a performance bottleneck for reading
> data from revlogs. I threw together a quick hack this morning to use the
> snappy compression algorithm instead. Here's what I've found.
> 
> Snappy compression is up to 15x faster than zlib (haven't seen it be less
> than 8x faster), while decompression is up to 4x faster (haven't seen it
> less than 2x faster). Of course there's a tradeoff: poorer compression
> ratios, about 1.5x larger than zlib in my tests.
> 
> The effect on downstream performance in Mercurial is fairly large.
> 
> In a linux-2.6 tree, updating from null to tip normally takes me 28
> seconds. Using snappy instead of zlib (and a clone --pull to convert), the
> time drops to 21.5 seconds, about 25% better. As expected, there's a price
> to pay: the .hg directory grows from 2.5GB to 3.5.
> 
> Reconstructing versions of the the manifest is both especially important
> and particularly painful. I have a repo with linear history, in which the
> most recent version of the manifest consists of a chain of 40,000 deltas
> before we hit a fulltext.

One possibility is to throw a disk cache at it.

If we cached a manifest revision that was:

a) uncompressed
b) sufficiently close to tip to avoid the bulk of decompression on the
working set
c) sufficiently far from tip to encompass the working set

..then we could probably achieve a time scale close to optimal. Deciding
when to create/update such a cache is an interesting problem though.

(This is reminiscent of the CVS technique of storing uncompressed full
revisions plus a chain of reverse deltas.)

-- 
Mathematics is the supreme nostalgia of our time.