Deltas, compression, and zstd

Tue Jan 10 18:34:47 UTC 2017

A few of us started an off-list conversation with Yann Collett (author of
lz4 and zstd and general compression guru) about deltas, compression, and
zstd. Most of that conversation is suitable for the public domain since it
revolves around the open source Mercurial project. So this thread will be
the continuation of that discussion.

My initial message (with minor modifications) to Yann follows.

---

My immediate goals for Mercurial are to allow zstd to be used as a drop-in
replacement for zlib wherever zlib is used, without changing /too/ much of
the architecture.

For streaming compression (producing standalone "bundle" files and
transferring data over the wire protocol), zstd has been nothing short of
amazing. https://www.mercurial-scm.org/pipermail/mercurial-devel/
2016-December/091797.html shows ~60% CPU usage while yielding better
compression ratios. You can't ask for much better than that!

The other major use of zlib in Mercurial is revlogs. You can read more
about them at https://www.mercurial-scm.org/repo/hg/help/internals.revlogs.
In short, revlogs are our storage structure for "blob" data. Each logical
entity (the set of commits (changelog), the set of files in a commit
(manifest), and each file/path tracked in history (filelog)) has its own
revlog (read: data within a specific revlog tends to look very similar).
Revlogs store data in delta chains. The first entry is stored as fulltext.
Then we compute and store deltas/diffs for subsequent entries. Rebuilding
the fulltext for a revision involves finding a base fulltext then walking
the delta chain to apply deltas until you rebuild the fulltext. We attempt
to zlib compress all deltas today, storing the zlib compressed result if it
is smaller or plaintext if it isn't.

In https://www.mercurial-scm.org/pipermail/mercurial-devel/
2017-January/091928.html, I'm attempting to replace zlib in revlogs with
zstd. It does "just work." But it isn't without its caveats. Many of them
are listed in https://www.mercurial-scm.org/pipermail/mercurial-devel/
2017-January/091982.html.

Here are some of the areas where I could use your expertise:

* Optimizing read performance. We'd really like changelog reads
(changeset/commit data) to be as fast as possible. We've already removed
delta chains there. I've also noticed that decompression with dictionaries
is substantially faster than without them. Is there further tuning of
compression parameters(read: not just integer comp level) to optimize for
read speed? (Note: changelog entries tend to be 100-300 bytes raw, with
some going into the tens of kilobytes or larger range.)

* Everything dictionary compression. See https://www.mercurial-scm.org/
pipermail/mercurial-devel/2017-January/091986.html. It seems like we could
get some major compression ratio and speed wins with dictionaries. But that
requires complications to build, store, and possibly transfer dictionaries.
At the very least, I'd like your thoughts on whether Mercurial should ship
pre-built dictionaries for specific use cases so people can get *some*
dictionary compression benefits today without us having to implement
complex dictionary management. If so, how could you provide guidance on
training?

* Could we emit/format revlog deltas differently or in a special way to
optimize for zstd?

* When should we compress versus store uncompressed? Right now, our
heuristic for storing a compressed chunk in a revlog is to store compressed
if it is smaller than the raw input. This feels sub-optimal to me because
if your compressed data is only 1 byte smaller than raw, it feels better to
just sacrifice that 1 byte in the name of performance. While I'm certainly
capable of measuring this and making this a configurable parameter, I was
wondering if you could provide insights for a good heuristic for
determining a) should we compress string X b) should we store the
compressed result.

* Frame header overhead. In some cases, the longer zstd frame header is
yielding larger storage sizes than zlib. This is mostly for very small
inputs (<150 bytes). We're storing the content size in the frame header (so
we can do one-shot decompression without over heap allocating). I don't
want to reinvent the wheel, but it is certainly tempting to construct our
own frame header for special cases (namely very small inputs).

Those are shorter term questions.

Down the road, we're thinking about more radical changes to the storage
model. We don't want 1 file/revlog per logical tracked entity (due to inode
scaling issues among other things). We want to put multiple logical
entities in the same container. There's a ton of possibilities for radical
compression and encoding techniques here.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170110/5507410f/attachment.html>