Deltas, compression, and zstd

Yann Collet cyan at fb.com
Wed Jan 11 09:19:04 EST 2017


You made many excellent points Gregory, backed by great investigations.
Let me try to answer some of your questions:


>  Optimizing read performance.
---------------------------------------------

In my tests, decompression time of small assets (<=300 bytes) tend to be dominated by Huffman table construction time.
Zstandard uses Huff0 for this task.
It tries to deal with small amounts by dynamically downsizing tables, in a bid to reduce construction time.
And it works pretty well, but in spite of this, it still is an important fraction of total time.

So, the next stage could be to completely eliminate Huffman compression stage.
It’s fairly simple to do, but the knob doesn’t exist, so tests require a source code change.

Example : Sandcastle test set, ~12500 files of average size ~250 bytes
Zstd -3, normal : Decompression Speed : 115 MB/s ; Compression Ratio : 1.39
Zstd -3, no huffman : Decompression Speed : 1180 MB/s ; Compression Ratio : 1.07
(10x faster, but also almost no compression left ….)

Unfortunately, it also completely destroys compression ratio, since on small assets, Huffman is generally the only remaining useful compression stage.

This experience explains why decompression with dictionaries is faster:
dictionaries also include “Default statistics”, thanks to which entropy tables, hence Huffman tables, can be built in advance.
Then, decompression just use them (when instructed to).
Skipping the build stage is a major speed boost for small assets.

Example: same Sandcastle test set, using a 45 KB dictionary
Zstd -3, dictionary : Decompression Speed : 660 MB/s ; Compression Ratio : 5.46

Without dictionaries and their pre-computed tables, it’s quite difficult to come up with some new solution to improve speed.
Dynamic downsizing of Huffman tables is already done, and has quite an impact already.
Maybe it could be even more aggressive, but that’s unclear (it would introduce new side-effects and efficiency losses on the compression side).

I also experimented with a “delta table constructor”, which would only update part of tables which are modified.
It would be useful when 2 consecutive tables happen to be “similar”.
But unfortunately, experiments showed negligible gains,
which I attributed to the fact that table construction is already extremely fast (at 115 MB/s, that’s almost ~500.000 table constructions per second)
and delta-checks is in itself a load, which ends up costing almost the same as building the table without asking questions.
It could also be that I fumbled the experiment, or could improve delta checks efficiency.

Anyway, even if there were some potential speed gains left, this would not improve compression ratio.
On the other hand, with dictionary, gains are quite perceptible, for both speed and compression ratio. So, it feels like a compelling direction.


>  I'd like your thoughts on whether Mercurial should ship pre-built dictionaries for specific use cases so people can get *some* dictionary compression benefits today without us having to implement complex dictionary management. If so, how could you provide guidance on training?

That’s a difficult one.
I’m not sure. It will need some testing.

The strong impact of dictionary is related to the fact that data to compress has some similarities.
Such a definition forbids the existence of a “dictionary of everything”.
The more specialized it is, the more efficient it is.

For source control, I had in mind one dictionary per repository
(for larger projects, it could even be divided into a dictionary per file-type).

A dictionary dedicated for a repository makes sense from a compression perspective, because there are most likely many common elements,
such as a main programming language, a naming scheme, coding style, linter rules, core libraries, etc.

Though it requires the repository to start to exist and accumulate information.
At some point, it becomes large enough to deserve a kind of “optimization pass”.
This would be an offline process, typically during some dead time.
Use it to create a dictionary from existing repository content, then recompress everything with said dictionary.
The dictionary could be periodically refreshed using the same process.

That sounds possible on paper, but I can also imagine corner cases introducing side-effects to tame.
At a minimum, it introduces a new “optimization stage” which current source control solutions may not be accustomed to.

So indeed, having some kind of “pre-built” dictionary would potentially eliminate such step, which sounds nice.

But the pre-built dictionary would need to be efficient on a wider range of possible repository contents.
Finding common element between different repositories is a much riskier bet.
Maybe there is a possibility to have a dictionary “adapted for a dedicated programming language” (and associated standard lib).
Sure, some keywords and expressions will be present.
But everything else is likely to be different, so the number of shared elements could end up being small.

That’s pure hand-waiving on my side, and I could be wrong.
Only way to know would be to test.

To make such a test, one would need a fairly large amount of source data to build the dictionary from.
The risk is to have common elements in samples which are in fact “locally common”, but not beyond the sample set.
For example, we could imagine that a function name is invoked often, because many sources tend to use a common library, which is in fact not commonly used outside of the sample set.
The better way to limit this risk is to use a lot of samples from many different sources.
Then check the resulting dictionary efficiency on some other samples, coming from other sources.

The result will probably be underwhelming compared to a repository-dedicated dictionary,
but it could still provide some tangible gains compared to no dictionary at all.
That’s the point of the test to check that.


>  Could we emit/format revlog deltas differently or in a special way to optimize for zstd?
----------------------------------------------------------------------------------------------------------------------------

We could have a discussion on how the “diff” stage works, and see if it could be simplified.
Maybe there’s a opportunity to save some cpu cost on one side and transfer it were it could be done more efficiently.


>  When should we compress versus store uncompressed?
-----------------------------------------------------------------------------------

There is no “single rule”. It depends on user’s objective.

Currently, Zstandard also has an equivalent mechanism, where it selects to keep a block “uncompressed” if compression does not provide a minimum saving.
See: https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_compress.c#L456
This rule is hardwired, and it’s debatable if it’s a good fit for all needs. Therefore, it will probably be made more configurable in the future.

You’ll have the same question on your side.
What’s a good threshold to deserve to be stored “compressed”, hence pay a “speed decompression cost” during reads?
Probably more than 1 byte. It’s up to the application to decide its precise threshold.


>  Frame header overhead
--------------------------------------

Indeed, Frame headers can represent a significant cost when data to compress is really small.
At a bare minimum, the cost is 6+3 = 9 bytes.
In general, it’s higher (typically 5+2+3 = 10 bytes for small data), and it depends on options (checksum adds 4 bytes, ending after flush cost 3 bytes, etc.)
If looks small, but when compressed data is < 50 bytes, that’s a significant budget.

When Frame headers start to be a concern, it’s possible to use the underlying “block API”:
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L655

Block API does not write any metadata. It’s a pure Zstandard compressed data block, as defined in specification:
https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#the-structure-of-data_block

Using this API empowers the user with a lot more responsibilities, which is a risk which can be more or less acceptable.
They are listed in code comment: https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L655
(no version control, no checksum, no support for uncompressed data, no block larger than 128 KB, no source size embedded, etc.)
This means the user will have to design its own metadata scheme.
Frame header is replaced by some other kind of header. It’s not free, but it can be made more compact, by using “implied” environmental limitations.

This API was published for closed dataset, such as database engines, where there is no need for external interactions (everything is filtered, internal representation doesn’t matter).
So, for Mercurial, I’m not sure if it’s a good choice.
But better know that it’s an available option.


>  Reusing zstd compression and decompression "contexts" can make a significant difference to performance. Having a reusable "compressor" object that allows "context" reuse should increase performance for zstd.
----------------------------------------------------------------------------------------------------------------------------------------

That’s a very important observation.
Indeed, keeping a working context “alive” and re-using it in successive operations can make a significant runtime difference when compressing or decompressing multiple small data blobs.
The main idea is that it re-uses existing resources, and saves several initialization steps, which can end-up costing a significant fraction of total operation (for small data).
Re-using a context can be difficult though, because it must be kept around between operations, and sometimes existing code does not allow such concept.
We have observed this issue in some git experiments for example.
Re-starting from scratch at each operation reduces the speed benefits Zstandard could provide compared to zlib on small data.


>  I'm considering per-revlog settings for the compressors.
--------------------------------------------------------------------------------

It totally makes sense.
Different data types react differently to the heuristics selected in compression level tables.
Sometime higher level is much more efficient, sometimes it’s barely better at all.


>  no matter how many bytes I tell zstd to make the dictionary, it always comes back with something around 100k
---------------------------------------------------------------------------------------------------------------------------------------------------------------

This is the default dictionary size, but it can be selected otherwise.
On the command line: using `--maxdict #`.
Using API: have a look at: ZDICT_trainFromBuffer_advanced()
https://github.com/facebook/zstd/blob/master/lib/dictBuilder/zdict.h#L82


>  When dictionaries are used, zstd level=1 compresses the changelog considerably faster than level=3. ~160 MB/s vs ~27 MB/s.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Currently, a big part of dictionary compression cost comes from duplicating the prepared search tables.
Consequently, on small files, total compression time is dominated by the size of these search tables.
This favours level 1 and 2, which use smaller tables.

In the medium term, we plan to introduce some new dictionary compression modes,
which should soften the blow when increasing compression level.
This is a significant change though, with many code modifications, so it will cost time.
Maybe mid-year…


>  would you mind taking this conversion somewhere public?
-------------------------------------------------------------------------------------

I totally agree. This would be beneficial, giving visibility to project,
and gathering 3rd party feedbacks, which can prove valuable in such a design phase.




From: Gregory Szorc <gregory.szorc at gmail.com>
Date: Tuesday, 10 January 2017 at 19:34
To: mercurial-devel <mercurial-devel at mercurial-scm.org>, Yann Collet <cyan at fb.com>, Jeremy Fitzhardinge <jsgf at fb.com>
Subject: Deltas, compression, and zstd

A few of us started an off-list conversation with Yann Collett (author of lz4 and zstd and general compression guru) about deltas, compression, and zstd. Most of that conversation is suitable for the public domain since it revolves around the open source Mercurial project. So this thread will be the continuation of that discussion.
My initial message (with minor modifications) to Yann follows.

---

My immediate goals for Mercurial are to allow zstd to be used as a drop-in replacement for zlib wherever zlib is used, without changing /too/ much of the architecture.

For streaming compression (producing standalone "bundle" files and transferring data over the wire protocol), zstd has been nothing short of amazing. https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-December/091797.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_pipermail_mercurial-2Ddevel_2016-2DDecember_091797.html&d=DgMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=KJJL6XxqEtblsU6B9nLu7dMWQF31lsjt4CmuqxknngU&s=Izr8VBInI4zkKTj0g7ncgT8h8WNH52awqLH0a4xXyuo&e=> shows ~60% CPU usage while yielding better compression ratios. You can't ask for much better than that!

The other major use of zlib in Mercurial is revlogs. You can read more about them at https://www.mercurial-scm.org/repo/hg/help/internals.revlogs<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_repo_hg_help_internals.revlogs&d=DgMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=KJJL6XxqEtblsU6B9nLu7dMWQF31lsjt4CmuqxknngU&s=1I59cdkhAOeQ0oaZkZdqLgn5f1ORat0VSBzD6grqD2g&e=>. In short, revlogs are our storage structure for "blob" data. Each logical entity (the set of commits (changelog), the set of files in a commit (manifest), and each file/path tracked in history (filelog)) has its own revlog (read: data within a specific revlog tends to look very similar). Revlogs store data in delta chains. The first entry is stored as fulltext. Then we compute and store deltas/diffs for subsequent entries. Rebuilding the fulltext for a revision involves finding a base fulltext then walking the delta chain to apply deltas until you rebuild the fulltext. We attempt to zlib compress all deltas today, storing the zlib compressed result if it is smaller or plaintext if it isn't.

In https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091928.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_pipermail_mercurial-2Ddevel_2017-2DJanuary_091928.html&d=DgMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=KJJL6XxqEtblsU6B9nLu7dMWQF31lsjt4CmuqxknngU&s=KVuPBnd6zMVEE6EGq9fSVa1cRViYiLZMO8soKgDoPcM&e=>, I'm attempting to replace zlib in revlogs with zstd. It does "just work." But it isn't without its caveats. Many of them are listed in https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091982.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_pipermail_mercurial-2Ddevel_2017-2DJanuary_091982.html&d=DgMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=KJJL6XxqEtblsU6B9nLu7dMWQF31lsjt4CmuqxknngU&s=erfoFQoh5-0wseWOSuOFGBVW_An7i2QGH56g8ZQWMmY&e=>.

Here are some of the areas where I could use your expertise:
* Optimizing read performance. We'd really like changelog reads (changeset/commit data) to be as fast as possible. We've already removed delta chains there. I've also noticed that decompression with dictionaries is substantially faster than without them. Is there further tuning of compression parameters(read: not just integer comp level) to optimize for read speed? (Note: changelog entries tend to be 100-300 bytes raw, with some going into the tens of kilobytes or larger range.)
* Everything dictionary compression. See https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/091986.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_pipermail_mercurial-2Ddevel_2017-2DJanuary_091986.html&d=DgMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=KJJL6XxqEtblsU6B9nLu7dMWQF31lsjt4CmuqxknngU&s=qXaRmRxrOCXGHKqn34_X6pfQxi4er_XO-s0NYe8g1NI&e=>. It seems like we could get some major compression ratio and speed wins with dictionaries. But that requires complications to build, store, and possibly transfer dictionaries. At the very least, I'd like your thoughts on whether Mercurial should ship pre-built dictionaries for specific use cases so people can get *some* dictionary compression benefits today without us having to implement complex dictionary management. If so, how could you provide guidance on training?
* Could we emit/format revlog deltas differently or in a special way to optimize for zstd?

* When should we compress versus store uncompressed? Right now, our heuristic for storing a compressed chunk in a revlog is to store compressed if it is smaller than the raw input. This feels sub-optimal to me because if your compressed data is only 1 byte smaller than raw, it feels better to just sacrifice that 1 byte in the name of performance. While I'm certainly capable of measuring this and making this a configurable parameter, I was wondering if you could provide insights for a good heuristic for determining a) should we compress string X b) should we store the compressed result.
* Frame header overhead. In some cases, the longer zstd frame header is yielding larger storage sizes than zlib. This is mostly for very small inputs (<150 bytes). We're storing the content size in the frame header (so we can do one-shot decompression without over heap allocating). I don't want to reinvent the wheel, but it is certainly tempting to construct our own frame header for special cases (namely very small inputs).

Those are shorter term questions.

Down the road, we're thinking about more radical changes to the storage model. We don't want 1 file/revlog per logical tracked entity (due to inode scaling issues among other things). We want to put multiple logical entities in the same container. There's a ton of possibilities for radical compression and encoding techniques here.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170111/1d63625a/attachment.html>


More information about the Mercurial-devel mailing list