Deltas, compression, and zstd

Tue Mar 7 18:36:04 UTC 2017

On Tue, Jan 10, 2017 at 10:34 AM, Gregory Szorc <gregory.szorc at gmail.com>
wrote:

> A few of us started an off-list conversation with Yann Collett (author of
> lz4 and zstd and general compression guru) about deltas, compression, and
> zstd. Most of that conversation is suitable for the public domain since it
> revolves around the open source Mercurial project. So this thread will be
> the continuation of that discussion.
>
> My initial message (with minor modifications) to Yann follows.
>
> ---
>
> My immediate goals for Mercurial are to allow zstd to be used as a
> drop-in replacement for zlib wherever zlib is used, without changing /too/
> much of the architecture.
>
> For streaming compression (producing standalone "bundle" files and
> transferring data over the wire protocol), zstd has been nothing short of
> amazing. https://www.mercurial-scm.org/pipermail/mercurial-devel/2016
> -December/091797.html shows ~60% CPU usage while yielding better
> compression ratios. You can't ask for much better than that!
>
> The other major use of zlib in Mercurial is revlogs. You can read more
> about them at https://www.mercurial-scm.org/repo/hg/help/internals.revlogs.
> In short, revlogs are our storage structure for "blob" data. Each logical
> entity (the set of commits (changelog), the set of files in a commit
> (manifest), and each file/path tracked in history (filelog)) has its own
> revlog (read: data within a specific revlog tends to look very similar).
> Revlogs store data in delta chains. The first entry is stored as fulltext.
> Then we compute and store deltas/diffs for subsequent entries. Rebuilding
> the fulltext for a revision involves finding a base fulltext then walking
> the delta chain to apply deltas until you rebuild the fulltext. We attempt
> to zlib compress all deltas today, storing the zlib compressed result if it
> is smaller or plaintext if it isn't.
>
> In https://www.mercurial-scm.org/pipermail/mercurial-devel/2017
> -January/091928.html, I'm attempting to replace zlib in revlogs with zstd.
> It does "just work." But it isn't without its caveats. Many of them are
> listed in https://www.mercurial-scm.org/pipermail/mercurial-devel/2017
> -January/091982.html.
>
> Here are some of the areas where I could use your expertise:
>
> * Optimizing read performance. We'd really like changelog reads
> (changeset/commit data) to be as fast as possible. We've already removed
> delta chains there. I've also noticed that decompression with dictionaries
> is substantially faster than without them. Is there further tuning of
> compression parameters(read: not just integer comp level) to optimize for
> read speed? (Note: changelog entries tend to be 100-300 bytes raw, with
> some going into the tens of kilobytes or larger range.)
>
> * Everything dictionary compression. See https://www.mercurial-scm.org/
> pipermail/mercurial-devel/2017-January/091986.html. It seems like we
> could get some major compression ratio and speed wins with dictionaries.
> But that requires complications to build, store, and possibly transfer
> dictionaries. At the very least, I'd like your thoughts on whether
> Mercurial should ship pre-built dictionaries for specific use cases so
> people can get *some* dictionary compression benefits today without us
> having to implement complex dictionary management. If so, how could you
> provide guidance on training?
>
> * Could we emit/format revlog deltas differently or in a special way to
> optimize for zstd?
>
> * When should we compress versus store uncompressed? Right now, our
> heuristic for storing a compressed chunk in a revlog is to store compressed
> if it is smaller than the raw input. This feels sub-optimal to me because
> if your compressed data is only 1 byte smaller than raw, it feels better to
> just sacrifice that 1 byte in the name of performance. While I'm certainly
> capable of measuring this and making this a configurable parameter, I was
> wondering if you could provide insights for a good heuristic for
> determining a) should we compress string X b) should we store the
> compressed result.
>
> * Frame header overhead. In some cases, the longer zstd frame header is
> yielding larger storage sizes than zlib. This is mostly for very small
> inputs (<150 bytes). We're storing the content size in the frame header (so
> we can do one-shot decompression without over heap allocating). I don't
> want to reinvent the wheel, but it is certainly tempting to construct our
> own frame header for special cases (namely very small inputs).
>
> Those are shorter term questions.
>
> Down the road, we're thinking about more radical changes to the storage
> model. We don't want 1 file/revlog per logical tracked entity (due to inode
> scaling issues among other things). We want to put multiple logical
> entities in the same container. There's a ton of possibilities for radical
> compression and encoding techniques here.
>
>
I just published a blog post about Zstandard:
http://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/

I used Mercurial data for a lot of the benchmarks. The section on
dictionary compression and the chart within (
http://gregoryszorc.com/images/decompression-dictionary-changeset.png) are
likely significant to many readers of this list. Not in this post are
numbers using python-zstandard's multi-threaded decompression API. With
that, I'm able to decompress changelog data at >2GB/s as measured on the
output end by using 4 threads. That drops changelog revision decompression
from ~1500ms to ~110ms on the mozilla-unified repo. I plan to connect with
people at the sprint about my proposed changes to the revlog API (and
possibly a new revlog format or storage format) to facilitate faster
(de)compression.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170307/31b69dbd/attachment.html>