Deltas, compression, and zstd

Gregory Szorc gregory.szorc at gmail.com
Mon Jan 16 17:07:36 EST 2017


On Sun, Jan 15, 2017 at 10:22 AM, Yann Collet <cyan at fb.com> wrote:

> From Gregory Szorc :
>
> Ø  The wording in this paragraph is confusing. Stated differently: if we
> have immutable files (e.g. "pack files"), then there are no stream append
> operations and all the complications around solving that problem go away.
> So "pack files" allow us to more easily employ the feed-in-fulltexts
> strategy I outlined.
>
> Ø  Also, at various points I used "zstd frame" when I meant to say "zstd
> block." A zstd "frame" is a collection of "blocks" plus a header and
> epilogue giving instructions on how to decode the block data. Each "frame"
> is completely independent. This means that the fulltext streaming mechanism
> I described operates at the zstd "block" level and not the "frame" level.
>
> It is unclear whether this means the zstd streaming C APIs are appropriate
> for this use case or whether we need to leverage the lower-level "block
> functions" defined by the C API. I /think/ the streaming C APIs operate at
> block boundaries, meaning that if you call flush() during compression you
> get an end of block and that if you feed a whole block into a decompressor
> that output will flush automatically. What I'm not sure of is whether using
> the C APIs in this manner is supported or condoned. Strictly speaking, we
> start an operation on a frame then abort midway through. There are some
> consequences to that, notably not verifying a checksum of the decompressed
> data (which is in an epilogue at the end of a frame). I'm sure Yann can
> clarify.
>
>
>
>
>
> For this usage, another potential suggestion could be to use dictionary
> compression in “content-only” mode.
>
>
>
> Here is how it could work :
>
> Presuming you have a “first original file”, and you want to send a second
> file as delta to the first :
>
> Compress the second file, using the first file as “dictionary”.
>
> (use ZSTD_compress_usingDict() for this scenario).
>
>
>
> This is not a properly formatted zstd dictionary, so it triggers “content
> only node”,
>
> which means it has no statistics, hence no way to precompute anything.
>
> But it can use the dictionary content for back-references.
>
> Hence common parts should be referenced as an LZ copy operation.
>
>
>
> This process can be chained, so that a file3 could use file2 as
> dictionary, and so on.
>
>
>
> Advantage : each file is compressed into a complete frame, so there is
> nothing “special” to do, API is already there.
>
> Specifically, there is no “compression state” to keep in memory between
> files,
>
> and no “unfinished” frame is stored.
>
>
>
> There are some disadvantages I can think of, but I believe they can be
> worked out :
>
>
>
> -          In decompression, referencing a “content-only” dictionary is
> an (almost) zero cost operation.
> For compression, it’s more expensive : the dictionary is analyzed in order
> to properly reference its content for subsequent search operations.
> The speed of this operations depends on both the size of dictionary and
> the compression level.
> So it could prove slow (but I have no figure to show).
> If it is slow, an optimization pass for content-referencing could be
> performed. It could also be Mercurial specific, it would not affect the
> format.
>
>
>
> -          If previous file is used as dictionary, then it’s not possible
> to use another “precomputed dictionary” at the same time. It’s one or the
> other.
> Now, this is current situation, but it could evolve. At the very least,
> some Mercurial-specific extension could solve this issue.
>
>
>
That "content-only" mode is a nifty, under-documented feature!

I was able to experiment with "content-only" mode without any changes to
the existing Python zstandard bindings since they allow constructing a
dictionary from raw bytes. However, profiling revealed a lot of overhead
from all the Python <-> C interactions. So I added a feature to the Python
bindings (not yet published) that allows passing in a list of zstd frames
produced in "content-only dictionary chaining mode" and it gives you the
final fulltext. This yielded a 3-4x speed boost across the board.

I have a `bench.py` script in the python-zstandard source repo that allows
you to feed in a directory of files and it will perform compression and
decompression various ways. Feeding it the fulltext of various revlog
revisions, we see that "content-only" mode looks impressive:

revlog.py fulltexts level 3
$ bench.py data/mercurial/revlog.py.i --level 3  --write-size --stream
--content-dict --discrete --discrete-dict
577 chunks; 24978602 bytes
trained dictionary of size 115667 (wanted 131072)
discrete compressed size: 7206429 (28.85%); smaller: 100.00%
discrete dict compressed size: 3143120 (12.58%); smaller: 100.00%
stream compressed size: 253147 (1.01%)
content dict compressed size: 96386 (0.39%); smaller: 100.00%

compress discrete compress() reuse zctx
0.119908 wall; 0.110000 CPU; 0.110000 user; 0.000000 sys 208.31 MB/s (best
of 26)
compress discrete dict compress() reuse zctx
0.076307 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 327.34 MB/s (best
of 39)
compress stream compressobj()
0.008314 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 3004.35 MB/s (best
of 337)
compress content dict compress()
0.078073 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 319.94 MB/s (best
of 36)

decompress discrete decompress() reuse zctx
0.034094 wall; 0.030000 CPU; 0.030000 user; 0.000000 sys 732.64 MB/s (best
of 87)
decompress discrete dict decompress() reuse zctx
0.020177 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 1237.97 MB/s (best
of 143)
decompress stream decompressobj()
0.004323 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 5778.06 MB/s (best
of 641)
decompress content dict decompress_content_dict_chain()
0.001282 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 19484.44 MB/s
(best of 2204)

revlog.py fulltexts level 1
577 chunks; 24978602 bytes
trained dictionary of size 115667 (wanted 131072)
discrete compressed size: 7586150 (30.37%); smaller: 100.00%
discrete dict compressed size: 4250994 (17.02%); smaller: 100.00%
stream compressed size: 485178 (1.94%)
content dict compressed size: 111679 (0.45%); smaller: 100.00%

compress discrete compress() reuse zctx
0.106496 wall; 0.100000 CPU; 0.100000 user; 0.000000 sys 234.55 MB/s (best
of 28)
compress discrete dict compress() reuse zctx
0.070336 wall; 0.060000 CPU; 0.060000 user; 0.000000 sys 355.13 MB/s (best
of 42)
compress stream compressobj()
0.011805 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 2115.92 MB/s (best
of 244)
compress content dict compress()
0.022679 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 1101.39 MB/s (best
of 127)

decompress discrete decompress() reuse zctx
0.035123 wall; 0.030000 CPU; 0.030000 user; 0.000000 sys 711.18 MB/s (best
of 82)
decompress discrete dict decompress() reuse zctx
0.025088 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 995.64 MB/s (best
of 109)
decompress stream decompressobj()
0.005347 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 4671.51 MB/s (best
of 519)
decompress content dict decompress_content_dict_chain()
0.001411 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 17703.25 MB/s
(best of 1960)

The 17+ GB/s decompress speed looks *really* impressive. But, I think when
I initially reported on the benefits of fulltext compression mode I let
this absolute value get the best of me.

If we operate on the chunks from the revlog (some fulltext but mostly
bdiffs):

revlog.py fulltexts level 3
$ bench.py data/mercurial/revlog.py.i --level 3  --write-size --stream
--content-dict --discrete --discrete-dict
561 chunks; 327221 bytes
trained dictionary of size 20960 (wanted 131072)
discrete compressed size: 153109 (46.79%); smaller: 82.89%
discrete dict compressed size: 104603 (31.97%); smaller: 94.65%
stream compressed size: 87628 (26.78%)
content dict compressed size: 141900 (43.37%); smaller: 86.63%

compress discrete compress() reuse zctx
0.003863 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 84.70 MB/s (best
of 756)
compress discrete dict compress() reuse zctx
0.006292 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 52.01 MB/s (best
of 446)
compress stream compressobj()
0.002343 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 139.66 MB/s (best
of 1143)
compress content dict compress()
0.005301 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 61.73 MB/s (best
of 538)

decompress discrete decompress() reuse zctx
0.001761 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 185.82 MB/s (best
of 1573)
decompress discrete dict decompress() reuse zctx
0.000987 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 331.51 MB/s (best
of 2822)
decompress stream decompressobj()
0.000682 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 479.71 MB/s (best
of 4126)
decompress content dict decompress_content_dict_chain()
0.001310 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 249.81 MB/s (best
of 2154)

In terms of absolute ordering:

prediff compress stream compressobj()
0.002343 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 139.66 MB/s (best
of 1143)
prediff compress discrete compress() reuse zctx
0.003863 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 84.70 MB/s (best
of 756)
prediff compress content dict compress()
0.005301 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 61.73 MB/s (best
of 538)
prediff compress discrete dict compress() reuse zctx
0.006292 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 52.01 MB/s (best
of 446)
fulltext compress stream compressobj()
0.008314 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 3004.35 MB/s (best
of 337)
fulltext compress discrete dict compress() reuse zctx
0.076307 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 327.34 MB/s (best
of 39)
fulltext compress content dict compress()
0.078073 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 319.94 MB/s (best
of 36)
fulltext compress discrete compress() reuse zctx
0.119908 wall; 0.110000 CPU; 0.110000 user; 0.000000 sys 208.31 MB/s (best
of 26)

prediff decompress stream decompressobj()
0.000682 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 479.71 MB/s (best
of 4126)
prediff decompress discrete dict decompress() reuse zctx
0.000987 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 331.51 MB/s (best
of 2822)
fulltext decompress content dict decompress_content_dict_chain()
0.001282 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 19484.44 MB/s
(best of 2204)
prediff decompress content dict decompress_content_dict_chain()
0.001310 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 249.81 MB/s (best
of 2154)
prediff decompress discrete decompress() reuse zctx
0.001761 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 185.82 MB/s (best
of 1573)
fulltext decompress stream decompressobj()
0.004323 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 5778.06 MB/s (best
of 641)
fulltext decompress discrete dict decompress() reuse zctx
0.020177 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 1237.97 MB/s (best
of 143)
fulltext decompress discrete decompress() reuse zctx
0.034094 wall; 0.030000 CPU; 0.030000 user; 0.000000 sys 732.64 MB/s (best
of 87)

This doesn't tell a full story because it excludes time for obtaining a
bdiff and applying patches. However, bdiff patch application is pretty
fast, so the decompression ordering is indicative. There, we see the
ordering as:

1) Streaming mode with prediffed inputs (would not write full frames to
revlog chunks)
2) Dictionary trained on prediffed inputs
3) Content-only dictionary w/ fulltext inputs
4) Content-only dictionary w/ prediffed inputs
5) Vanilla compression on prediffed inputs

It's worth noting that vanilla compression on prediffed inputs (#5) is what
we would use if we were to replace zlib with zstd in revlogs.

But again, it is more complicated than wall times for decompression:

* Compressed size of fulltext input is often larger than compressing
prediffed inputs (especially if fulltext is a few hundred KB or larger)
* Fulltext discrete dict only requires a single decompress operation
whereas all other modes require N decompress operations

There's still a lot of experimentation to be performed. I'd really like to
perform a bulk conversion of real world repos and measure compressed size
changes, etc. I suspect there are thresholds in fulltext sizes where e.g.
content-only dict mode's performance starts to suffer. I'm not sure if we
can work around this by tuning zstd compression parameters or not.

FWIW, in content-only dict mode, profiling revealed that >70% of CPU time
during decompression was spent in __memmove_avx_unaligned_erms with
ZSTD_decompressBlock_internal() being on the stack. This is a bit head
scratching to me, as ZSTD_decompressBlock_internal() doesn't appear to
directly do any major memory operations. But `perf` (and the raw output
from `perf script`) said that this function was the caller of
__memmove_avx_unaligned_erms. I even reproduced this in a debug build to
rule out compiler optimization confusing stack/symbol resolution. Since
content-only dict mode is producing decompressed fulltexts at 17+ GB/s and
an AVX-optimized memory-related function appears to be the hot spot, it
certainly looks like decompression in content-only dict mode is bound by
memory throughput!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170116/b20389f1/attachment.html>


More information about the Mercurial-devel mailing list