<div dir="ltr"><div>A few of us started an off-list conversation with Yann Collett (author of lz4 and zstd and general compression guru) about deltas, compression, and zstd. Most of that conversation is suitable for the public domain since it revolves around the open source Mercurial project. So this thread will be the continuation of that discussion.<br><br></div>My initial message (with minor modifications) to Yann follows.<br><br>---<br><br>My immediate goals for Mercurial are to allow <span class="gmail-il">zstd</span> to be used as a drop-in replacement for zlib wherever zlib is used, without changing /too/ much of the architecture.<br><br>For streaming compression (producing standalone "bundle" files and transferring data over the wire protocol), <span class="gmail-il">zstd</span> has been nothing short of amazing. <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-December/091797.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2016-December/091797.html</a> shows ~60% CPU usage while yielding better compression ratios. You can't ask for much better than that!<br><br>The other major use of zlib in Mercurial is revlogs. You can read more about them at <a href="https://www.mercurial-scm.org/repo/hg/help/internals.revlogs" target="_blank">https://www.mercurial-scm.org/<wbr>repo/hg/help/internals.revlogs</a><wbr>.
In short, revlogs are our storage structure for "blob" data. Each
logical entity (the set of commits (changelog), the set of files in a
commit (manifest), and each file/path tracked in history (filelog)) has
its own revlog (read: data within a specific revlog tends to look very
similar). Revlogs store data in delta chains. The first entry is stored
as fulltext. Then we compute and store deltas/diffs for subsequent
entries. Rebuilding the fulltext for a revision involves finding a base
fulltext then walking the delta chain to apply deltas until you rebuild
the fulltext. We attempt to zlib compress all deltas today, storing the
zlib compressed result if it is smaller or plaintext if it isn't.<br><br>In <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091928.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2017-January/091928.html</a>, I'm attempting to replace zlib in revlogs with <span class="gmail-il">zstd</span>. It does "just work." But it isn't without its caveats. Many of them are listed in <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091982.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2017-January/091982.html</a>.<br><div><br></div><div>Here are some of the areas where I could use your expertise:<br><br></div><div>*
Optimizing read performance. We'd really like changelog reads
(changeset/commit data) to be as fast as possible. We've already removed
delta chains there. I've also noticed that decompression with
dictionaries is substantially faster than without them. Is there further
tuning of compression parameters(read: not just integer comp level) to
optimize for read speed? (Note: changelog entries tend to be 100-300
bytes raw, with some going into the tens of kilobytes or larger range.)<br><br></div><div>* Everything dictionary compression. See <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091986.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2017-January/091986.html</a>.
It seems like we could get some major compression ratio and speed wins
with dictionaries. But that requires complications to build, store, and
possibly transfer dictionaries. At the very least, I'd like your
thoughts on whether Mercurial should ship pre-built dictionaries for
specific use cases so people can get *some* dictionary compression
benefits today without us having to implement complex dictionary
management. If so, how could you provide guidance on training?<br><br></div><div>* Could we emit/format revlog deltas differently or in a special way to optimize for <span class="gmail-il">zstd</span>?<br></div><div><br>* When should we compress versus store uncompressed? Right now, our
heuristic for storing a compressed chunk in a revlog is to store
compressed if it is smaller than the raw input. This feels sub-optimal
to me because if your compressed data is only 1 byte smaller than raw,
it feels better to just sacrifice that 1 byte in the name of
performance. While I'm certainly capable of measuring this and making
this a configurable parameter, I was wondering if you could provide
insights for a good heuristic for determining a) should we compress
string X b) should we store the compressed result.<br><br></div><div>* Frame header overhead. In some cases, the longer <span class="gmail-il">zstd</span>
frame header is yielding larger storage sizes than zlib. This is mostly
for very small inputs (<150 bytes). We're storing the content size
in the frame header (so we can do one-shot decompression without over
heap allocating). I don't want to reinvent the wheel, but it is
certainly tempting to construct our own frame header for special cases
(namely very small inputs).<br></div><div><br></div><div>Those are shorter term questions.<br></div><div><br></div><div>Down
the road, we're thinking about more radical changes to the storage
model. We don't want 1 file/revlog per logical tracked entity (due to
inode scaling issues among other things). We want to put multiple
logical entities in the same container. There's a ton of possibilities
for radical compression and encoding techniques here.<br><br></div></div>