<div dir="ltr"><div>A few of us started an off-list conversation with Yann Collett (author of lz4 and zstd and general compression guru) about deltas, compression, and zstd. Most of that conversation is suitable for the public domain since it revolves around the open source Mercurial project. So this thread will be the continuation of that discussion.<br><br></div>My initial message (with minor modifications) to Yann follows.<br><br>---<br><br>My immediate goals for Mercurial are to allow <span class="gmail-il">zstd</span> to be used as a drop-in replacement for zlib wherever zlib is used, without changing /too/ much of the architecture.<br><br>For streaming compression (producing standalone "bundle" files and transferring data over the wire protocol), <span class="gmail-il">zstd</span> has been nothing short of amazing. <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-December/091797.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2016-December/091797.html</a> shows ~60% CPU usage while yielding better compression ratios. You can't ask for much better than that!<br><br>The other major use of zlib in Mercurial is revlogs. You can read more about them at <a href="https://www.mercurial-scm.org/repo/hg/help/internals.revlogs" target="_blank">https://www.mercurial-scm.org/<wbr>repo/hg/help/internals.revlogs</a><wbr>.

 In short, revlogs are our storage structure for "blob" data. Each 

logical entity (the set of commits (changelog), the set of files in a 

commit (manifest), and each file/path tracked in history (filelog)) has 

its own revlog (read: data within a specific revlog tends to look very 

similar). Revlogs store data in delta chains. The first entry is stored 

as fulltext. Then we compute and store deltas/diffs for subsequent 

entries. Rebuilding the fulltext for a revision involves finding a base 

fulltext then walking the delta chain to apply deltas until you rebuild 

the fulltext. We attempt to zlib compress all deltas today, storing the 

zlib compressed result if it is smaller or plaintext if it isn't.<br><br>In <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091928.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2017-January/091928.html</a>, I'm attempting to replace zlib in revlogs with <span class="gmail-il">zstd</span>. It does "just work." But it isn't without its caveats. Many of them are listed in <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091982.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2017-January/091982.html</a>.<br><div><br></div><div>Here are some of the areas where I could use your expertise:<br><br></div><div>*

 Optimizing read performance. We'd really like changelog reads 

(changeset/commit data) to be as fast as possible. We've already removed

 delta chains there. I've also noticed that decompression with 

dictionaries is substantially faster than without them. Is there further

 tuning of compression parameters(read: not just integer comp level) to 

optimize for read speed? (Note: changelog entries tend to be 100-300 

bytes raw, with some going into the tens of kilobytes or larger range.)<br><br></div><div>* Everything dictionary compression. See <a href="https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091986.html" target="_blank">https://www.mercurial-scm.org/<wbr>pipermail/mercurial-devel/<wbr>2017-January/091986.html</a>.

 It seems like we could get some major compression ratio and speed wins 

with dictionaries. But that requires complications to build, store, and 

possibly transfer dictionaries. At the very least, I'd like your 

thoughts on whether Mercurial should ship pre-built dictionaries for 

specific use cases so people can get *some* dictionary compression 

benefits today without us having to implement complex dictionary 

management. If so, how could you provide guidance on training?<br><br></div><div>* Could we emit/format revlog deltas differently or in a special way to optimize for <span class="gmail-il">zstd</span>?<br></div><div><br>* When should we compress versus store uncompressed? Right now, our 

heuristic for storing a compressed chunk in a revlog is to store 

compressed if it is smaller than the raw input. This feels sub-optimal 

to me because if your compressed data is only 1 byte smaller than raw, 

it feels better to just sacrifice that 1 byte in the name of 

performance. While I'm certainly capable of measuring this and making 

this a configurable parameter, I was wondering if you could provide 

insights for a good heuristic for determining a) should we compress 

string X b) should we store the compressed result.<br><br></div><div>* Frame header overhead. In some cases, the longer <span class="gmail-il">zstd</span>

 frame header is yielding larger storage sizes than zlib. This is mostly

 for very small inputs (<150 bytes). We're storing the content size 

in the frame header (so we can do one-shot decompression without over 

heap allocating). I don't want to reinvent the wheel, but it is 

certainly tempting to construct our own frame header for special cases 

(namely very small inputs).<br></div><div><br></div><div>Those are shorter term questions.<br></div><div><br></div><div>Down

 the road, we're thinking about more radical changes to the storage 

model. We don't want 1 file/revlog per logical tracked entity (due to 

inode scaling issues among other things). We want to put multiple 

logical entities in the same container. There's a ton of possibilities 

for radical compression and encoding techniques here.<br><br></div></div>