Deltas, compression, and zstd

Tue Jan 17 14:50:25 EST 2017

Ø  * limited zstd window size almost certainly made sqlite.c's compression size blow up when using content-only dict compression

sqlite3.c is indeed especially large for a single source file (3.9 MB)

There is a way to work around this, although it’s a bit more complex :
it relies on using the _advanced() variant.

Here, it’s possible to precisely select compression parameters.

One of them is WindowLog.
WindowLog is 19 for levels 1 and 2, which means 512 KB.
If a source file is larger than 512 KB, LZ diff won’t work, so this value needs to be increased accordingly.
For sqlite3.c for example, it would need to be set to 22 (4 MB).

This is only a part of the story though.
Because the search tables of levels 1 and 2 “forget” more things as distance increase.
I haven’t tested them with such distances, so maybe they are too small to properly map long distance areas.
This is all probabilistic though, maybe it will nonetheless work, as there is no “hard limit”.

Here also, there is a way to improve that, by increasing the value of hashLog.
Obviously, it also translates into higher memory usage, which could be detrimental to speed …

From: Gregory Szorc <gregory.szorc at gmail.com>
Date: Monday, 16 January 2017 at 22:04
To: Yann Collet <cyan at fb.com>
Cc: mercurial-devel <mercurial-devel at mercurial-scm.org>, Jeremy Fitzhardinge <jsgf at fb.com>
Subject: Re: Deltas, compression, and zstd

On Mon, Jan 16, 2017 at 2:07 PM, Gregory Szorc <gregory.szorc at gmail.com<mailto:gregory.szorc at gmail.com>> wrote:
On Sun, Jan 15, 2017 at 10:22 AM, Yann Collet <cyan at fb.com<mailto:cyan at fb.com>> wrote:
From Gregory Szorc :

>  The wording in this paragraph is confusing. Stated differently: if we have immutable files (e.g. "pack files"), then there are no stream append operations and all the complications around solving that problem go away. So "pack files" allow us to more easily employ the feed-in-fulltexts strategy I outlined.

>  Also, at various points I used "zstd frame" when I meant to say "zstd block." A zstd "frame" is a collection of "blocks" plus a header and epilogue giving instructions on how to decode the block data. Each "frame" is completely independent. This means that the fulltext streaming mechanism I described operates at the zstd "block" level and not the "frame" level.

It is unclear whether this means the zstd streaming C APIs are appropriate for this use case or whether we need to leverage the lower-level "block functions" defined by the C API. I /think/ the streaming C APIs operate at block boundaries, meaning that if you call flush() during compression you get an end of block and that if you feed a whole block into a decompressor that output will flush automatically. What I'm not sure of is whether using the C APIs in this manner is supported or condoned. Strictly speaking, we start an operation on a frame then abort midway through. There are some consequences to that, notably not verifying a checksum of the decompressed data (which is in an epilogue at the end of a frame). I'm sure Yann can clarify.

For this usage, another potential suggestion could be to use dictionary compression in “content-only” mode.

Here is how it could work :
Presuming you have a “first original file”, and you want to send a second file as delta to the first :
Compress the second file, using the first file as “dictionary”.
(use ZSTD_compress_usingDict() for this scenario).

This is not a properly formatted zstd dictionary, so it triggers “content only node”,
which means it has no statistics, hence no way to precompute anything.
But it can use the dictionary content for back-references.
Hence common parts should be referenced as an LZ copy operation.

This process can be chained, so that a file3 could use file2 as dictionary, and so on.

Advantage : each file is compressed into a complete frame, so there is nothing “special” to do, API is already there.
Specifically, there is no “compression state” to keep in memory between files,
and no “unfinished” frame is stored.

There are some disadvantages I can think of, but I believe they can be worked out :

-          In decompression, referencing a “content-only” dictionary is an (almost) zero cost operation.
For compression, it’s more expensive : the dictionary is analyzed in order to properly reference its content for subsequent search operations.
The speed of this operations depends on both the size of dictionary and the compression level.
So it could prove slow (but I have no figure to show).
If it is slow, an optimization pass for content-referencing could be performed. It could also be Mercurial specific, it would not affect the format.

-          If previous file is used as dictionary, then it’s not possible to use another “precomputed dictionary” at the same time. It’s one or the other.
Now, this is current situation, but it could evolve. At the very least, some Mercurial-specific extension could solve this issue.

That "content-only" mode is a nifty, under-documented feature!
I was able to experiment with "content-only" mode without any changes to the existing Python zstandard bindings since they allow constructing a dictionary from raw bytes. However, profiling revealed a lot of overhead from all the Python <-> C interactions. So I added a feature to the Python bindings (not yet published) that allows passing in a list of zstd frames produced in "content-only dictionary chaining mode" and it gives you the final fulltext. This yielded a 3-4x speed boost across the board.
I have a `bench.py` script in the python-zstandard source repo that allows you to feed in a directory of files and it will perform compression and decompression various ways. Feeding it the fulltext of various revlog revisions, we see that "content-only" mode looks impressive:

revlog.py fulltexts level 3
$ bench.py data/mercurial/revlog.py.i --level 3  --write-size --stream --content-dict --discrete --discrete-dict
577 chunks; 24978602 bytes
trained dictionary of size 115667 (wanted 131072)
discrete compressed size: 7206429 (28.85%); smaller: 100.00%
discrete dict compressed size: 3143120 (12.58%); smaller: 100.00%
stream compressed size: 253147 (1.01%)
content dict compressed size: 96386 (0.39%); smaller: 100.00%

compress discrete compress() reuse zctx
0.119908 wall; 0.110000 CPU; 0.110000 user; 0.000000 sys 208.31 MB/s (best of 26)
compress discrete dict compress() reuse zctx
0.076307 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 327.34 MB/s (best of 39)
compress stream compressobj()
0.008314 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 3004.35 MB/s (best of 337)
compress content dict compress()
0.078073 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 319.94 MB/s (best of 36)

decompress discrete decompress() reuse zctx
0.034094 wall; 0.030000 CPU; 0.030000 user; 0.000000 sys 732.64 MB/s (best of 87)
decompress discrete dict decompress() reuse zctx
0.020177 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 1237.97 MB/s (best of 143)
decompress stream decompressobj()
0.004323 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 5778.06 MB/s (best of 641)
decompress content dict decompress_content_dict_chain()
0.001282 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 19484.44 MB/s (best of 2204)

revlog.py fulltexts level 1
577 chunks; 24978602 bytes
trained dictionary of size 115667 (wanted 131072)
discrete compressed size: 7586150 (30.37%); smaller: 100.00%
discrete dict compressed size: 4250994 (17.02%); smaller: 100.00%
stream compressed size: 485178 (1.94%)
content dict compressed size: 111679 (0.45%); smaller: 100.00%

compress discrete compress() reuse zctx
0.106496 wall; 0.100000 CPU; 0.100000 user; 0.000000 sys 234.55 MB/s (best of 28)
compress discrete dict compress() reuse zctx
0.070336 wall; 0.060000 CPU; 0.060000 user; 0.000000 sys 355.13 MB/s (best of 42)
compress stream compressobj()
0.011805 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 2115.92 MB/s (best of 244)
compress content dict compress()
0.022679 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 1101.39 MB/s (best of 127)

decompress discrete decompress() reuse zctx
0.035123 wall; 0.030000 CPU; 0.030000 user; 0.000000 sys 711.18 MB/s (best of 82)
decompress discrete dict decompress() reuse zctx
0.025088 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 995.64 MB/s (best of 109)
decompress stream decompressobj()
0.005347 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 4671.51 MB/s (best of 519)
decompress content dict decompress_content_dict_chain()
0.001411 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 17703.25 MB/s (best of 1960)
The 17+ GB/s decompress speed looks *really* impressive. But, I think when I initially reported on the benefits of fulltext compression mode I let this absolute value get the best of me.
If we operate on the chunks from the revlog (some fulltext but mostly bdiffs):

revlog.py fulltexts level 3
$ bench.py data/mercurial/revlog.py.i --level 3  --write-size --stream --content-dict --discrete --discrete-dict
561 chunks; 327221 bytes
trained dictionary of size 20960 (wanted 131072)
discrete compressed size: 153109 (46.79%); smaller: 82.89%
discrete dict compressed size: 104603 (31.97%); smaller: 94.65%
stream compressed size: 87628 (26.78%)
content dict compressed size: 141900 (43.37%); smaller: 86.63%

compress discrete compress() reuse zctx
0.003863 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 84.70 MB/s (best of 756)
compress discrete dict compress() reuse zctx
0.006292 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 52.01 MB/s (best of 446)
compress stream compressobj()
0.002343 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 139.66 MB/s (best of 1143)
compress content dict compress()
0.005301 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 61.73 MB/s (best of 538)

decompress discrete decompress() reuse zctx
0.001761 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 185.82 MB/s (best of 1573)
decompress discrete dict decompress() reuse zctx
0.000987 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 331.51 MB/s (best of 2822)
decompress stream decompressobj()
0.000682 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 479.71 MB/s (best of 4126)
decompress content dict decompress_content_dict_chain()
0.001310 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 249.81 MB/s (best of 2154)
In terms of absolute ordering:

prediff compress stream compressobj()
0.002343 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 139.66 MB/s (best of 1143)
prediff compress discrete compress() reuse zctx
0.003863 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 84.70 MB/s (best of 756)
prediff compress content dict compress()
0.005301 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 61.73 MB/s (best of 538)
prediff compress discrete dict compress() reuse zctx
0.006292 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 52.01 MB/s (best of 446)
fulltext compress stream compressobj()
0.008314 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 3004.35 MB/s (best of 337)
fulltext compress discrete dict compress() reuse zctx
0.076307 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 327.34 MB/s (best of 39)
fulltext compress content dict compress()
0.078073 wall; 0.070000 CPU; 0.070000 user; 0.000000 sys 319.94 MB/s (best of 36)
fulltext compress discrete compress() reuse zctx
0.119908 wall; 0.110000 CPU; 0.110000 user; 0.000000 sys 208.31 MB/s (best of 26)

prediff decompress stream decompressobj()
0.000682 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 479.71 MB/s (best of 4126)
prediff decompress discrete dict decompress() reuse zctx
0.000987 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 331.51 MB/s (best of 2822)
fulltext decompress content dict decompress_content_dict_chain()
0.001282 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 19484.44 MB/s (best of 2204)
prediff decompress content dict decompress_content_dict_chain()
0.001310 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 249.81 MB/s (best of 2154)
prediff decompress discrete decompress() reuse zctx
0.001761 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 185.82 MB/s (best of 1573)
fulltext decompress stream decompressobj()
0.004323 wall; 0.000000 CPU; 0.000000 user; 0.000000 sys 5778.06 MB/s (best of 641)
fulltext decompress discrete dict decompress() reuse zctx
0.020177 wall; 0.020000 CPU; 0.020000 user; 0.000000 sys 1237.97 MB/s (best of 143)
fulltext decompress discrete decompress() reuse zctx
0.034094 wall; 0.030000 CPU; 0.030000 user; 0.000000 sys 732.64 MB/s (best of 87)

This doesn't tell a full story because it excludes time for obtaining a bdiff and applying patches. However, bdiff patch application is pretty fast, so the decompression ordering is indicative. There, we see the ordering as:
1) Streaming mode with prediffed inputs (would not write full frames to revlog chunks)
2) Dictionary trained on prediffed inputs
3) Content-only dictionary w/ fulltext inputs
4) Content-only dictionary w/ prediffed inputs
5) Vanilla compression on prediffed inputs

It's worth noting that vanilla compression on prediffed inputs (#5) is what we would use if we were to replace zlib with zstd in revlogs.
But again, it is more complicated than wall times for decompression:
* Compressed size of fulltext input is often larger than compressing prediffed inputs (especially if fulltext is a few hundred KB or larger)
* Fulltext discrete dict only requires a single decompress operation whereas all other modes require N decompress operations
There's still a lot of experimentation to be performed. I'd really like to perform a bulk conversion of real world repos and measure compressed size changes, etc. I suspect there are thresholds in fulltext sizes where e.g. content-only dict mode's performance starts to suffer. I'm not sure if we can work around this by tuning zstd compression parameters or not.
FWIW, in content-only dict mode, profiling revealed that >70% of CPU time during decompression was spent in __memmove_avx_unaligned_erms with ZSTD_decompressBlock_internal() being on the stack. This is a bit head scratching to me, as ZSTD_decompressBlock_internal() doesn't appear to directly do any major memory operations. But `perf` (and the raw output from `perf script`) said that this function was the caller of __memmove_avx_unaligned_erms. I even reproduced this in a debug build to rule out compiler optimization confusing stack/symbol resolution. Since content-only dict mode is producing decompressed fulltexts at 17+ GB/s and an AVX-optimized memory-related function appears to be the hot spot, it certainly looks like decompression in content-only dict mode is bound by memory throughput!

I found a problem with my methodology: when I was testing with python-zstandard's `bench.py`, I was feeding in revlog entries in the order they exist in the revlog. And unless you have a perfectly linear repo with 1 head, revlog order won't be the same as DAG order.
When I ported the code to a Mercurial perf* command to make it DAG aware, performance improved substantially. It also allowed me to accurately measure the impact of bdiff and patching on writing and reading speeds.
Here are some examples from the Firefox repo with proper DAG ordering using zstd level 3 (smallest files first).

$ perfcompression gfx/2d/ScaledFontDWrite.h
processing 16 revisions with 38057 fulltext size (2378 mean)
first revision size: 1995
total bdiff size: 2483 (6.52%)
! bdiff
! wall 0.000066 comb 0.000000 user 0.000000 sys 0.000000 (best of 40626)
! patch
! wall 0.000002 comb 0.000000 user 0.000000 sys 0.000000 (best of 785340)
! zlib delta chain compress
! wall 0.000179 comb 0.000000 user 0.000000 sys 0.000000 (best of 15118)
! zstd delta chain compress
! wall 0.000073 comb 0.000000 user 0.000000 sys 0.000000 (best of 38138)
! zstd content dict compress
! wall 0.000154 comb 0.000000 user 0.000000 sys 0.000000 (best of 18006)
! zlib delta chain decompress
! wall 0.000021 comb 0.000000 user 0.000000 sys 0.000000 (best of 119387)
! zstd delta chain decompress
! wall 0.000028 comb 0.000000 user 0.000000 sys 0.000000 (best of 91123)
! zstd content dict decompress
! wall 0.000009 comb 0.000000 user 0.000000 sys 0.000000 (best of 268004)
zlib delta chain: 2679 (7.039%)
zstd delta chain: 2860 (7.515%)
zstd content dict fulltext: 2079 (5.463%)

hg perfcompression js/src/jit/x64/CodeGenerator-x64.h
processing 74 revisions with 253324 fulltext size (3423 mean)
first revision size: 2799
total bdiff size: 13303 (5.25%)
! bdiff
! wall 0.000330 comb 0.000000 user 0.000000 sys 0.000000 (best of 8703)
! patch
! wall 0.000010 comb 0.000000 user 0.000000 sys 0.000000 (best of 246119)
! zlib delta chain compress
! wall 0.000603 comb 0.000000 user 0.000000 sys 0.000000 (best of 4723)
! zstd delta chain compress
! wall 0.000205 comb 0.000000 user 0.000000 sys 0.000000 (best of 14072)
! zstd content dict compress
! wall 0.000851 comb 0.000000 user 0.000000 sys 0.000000 (best of 3383)
! zlib delta chain decompress
! wall 0.000090 comb 0.000000 user 0.000000 sys 0.000000 (best of 31690)
! zstd delta chain decompress
! wall 0.000084 comb 0.000000 user 0.000000 sys 0.000000 (best of 32945)
! zstd content dict decompress
! wall 0.000027 comb 0.000000 user 0.000000 sys 0.000000 (best of 98905)
zlib delta chain: 7508 (2.964%)
zstd delta chain: 8375 (3.306%)
zstd content dict fulltext: 5145 (2.031%)

$ hg perfcompression js/src/Makefile.in
processing 99 revisions with 1221713 fulltext size (12340 mean)
first revision size: 18123
total bdiff size: 13419 (1.10%)
! bdiff
! wall 0.002807 comb 0.000000 user 0.000000 sys 0.000000 (best of 1035)
! patch
! wall 0.000014 comb 0.000000 user 0.000000 sys 0.000000 (best of 179513)
! zlib delta chain compress
! wall 0.001031 comb 0.000000 user 0.000000 sys 0.000000 (best of 2802)
! zstd delta chain compress
! wall 0.000368 comb 0.000000 user 0.000000 sys 0.000000 (best of 7822)
! zstd content dict compress
! wall 0.004614 comb 0.010000 user 0.010000 sys 0.000000 (best of 622)
! zlib delta chain decompress
! wall 0.000166 comb 0.000000 user 0.000000 sys 0.000000 (best of 17390)
! zstd delta chain decompress
! wall 0.000144 comb 0.000000 user 0.000000 sys 0.000000 (best of 17740)
! zstd content dict decompress
! wall 0.000067 comb 0.000000 user 0.000000 sys 0.000000 (best of 41588)
zlib delta chain: 15176 (1.242%)
zstd delta chain: 16412 (1.343%)
zstd content dict fulltext: 12193 (0.998%)

$ hg perfcompression toolkit/components/jsdownloads/src/DownloadLegacy.js
processing 44 revisions with 507017 fulltext size (11523 mean)
first revision size: 8417
total bdiff size: 17103 (3.37%)
! bdiff
! wall 0.000841 comb 0.000000 user 0.000000 sys 0.000000 (best of 3436)
! patch
! wall 0.000008 comb 0.000000 user 0.000000 sys 0.000000 (best of 287300)
! zlib delta chain compress
! wall 0.000765 comb 0.000000 user 0.000000 sys 0.000000 (best of 3786)
! zstd delta chain compress
! wall 0.000310 comb 0.000000 user 0.000000 sys 0.000000 (best of 9377)
! zstd content dict compress
! wall 0.001793 comb 0.000000 user 0.000000 sys 0.000000 (best of 1605)
! zlib delta chain decompress
! wall 0.000169 comb 0.000000 user 0.000000 sys 0.000000 (best of 17185)
! zstd delta chain decompress
! wall 0.000126 comb 0.000000 user 0.000000 sys 0.000000 (best of 22515)
! zstd content dict decompress
! wall 0.000056 comb 0.000000 user 0.000000 sys 0.000000 (best of 49115)
zlib delta chain: 12199 (2.406%)
zstd delta chain: 12877 (2.540%)
zstd content dict fulltext: 8469 (1.670%)

$ hg perfcompression js/src/vm/ArgumentsObject.cpp
processing 53 revisions with 1256894 fulltext size (23714 mean)
first revision size: 20775
total bdiff size: 42117 (3.35%)
! bdiff
! wall 0.003441 comb 0.000000 user 0.000000 sys 0.000000 (best of 848)
! patch
! wall 0.000015 comb 0.000000 user 0.000000 sys 0.000000 (best of 174010)
! zlib delta chain compress
! wall 0.001391 comb 0.000000 user 0.000000 sys 0.000000 (best of 2109)
! zstd delta chain compress
! wall 0.000503 comb 0.000000 user 0.000000 sys 0.000000 (best of 5666)
! zstd content dict compress
! wall 0.005379 comb 0.000000 user 0.000000 sys 0.000000 (best of 535)
! zlib delta chain decompress
! wall 0.000257 comb 0.000000 user 0.000000 sys 0.000000 (best of 11316)
! zstd delta chain decompress
! wall 0.000191 comb 0.000000 user 0.000000 sys 0.000000 (best of 14869)
! zstd content dict decompress
! wall 0.000131 comb 0.000000 user 0.000000 sys 0.000000 (best of 21712)
zlib delta chain: 21555 (1.715%)
zstd delta chain: 23163 (1.843%)
zstd content dict fulltext: 14417 (1.147%)

$ hg perfcompression js/src/frontend/TokenStream.cpp
processing 331 revisions with 21145087 fulltext size (63882 mean)
first revision size: 66826
total bdiff size: 218523 (1.03%)
! bdiff
! wall 0.082915 comb 0.080000 user 0.080000 sys 0.000000 (best of 100)
! patch
! wall 0.000143 comb 0.000000 user 0.000000 sys 0.000000 (best of 19676)
! zlib delta chain compress
! wall 0.007803 comb 0.000000 user 0.000000 sys 0.000000 (best of 374)
! zstd delta chain compress
! wall 0.002723 comb 0.000000 user 0.000000 sys 0.000000 (best of 1075)
! zstd content dict compress
! wall 0.097417 comb 0.100000 user 0.050000 sys 0.050000 (best of 97)
! zlib delta chain decompress
! wall 0.001420 comb 0.000000 user 0.000000 sys 0.000000 (best of 2058)
! zstd delta chain decompress
! wall 0.001168 comb 0.010000 user 0.010000 sys 0.000000 (best of 2497)
! zstd content dict decompress
! wall 0.000974 comb 0.000000 user 0.000000 sys 0.000000 (best of 2964)
zlib delta chain: 113045 (0.535%)
zstd delta chain: 121806 (0.576%)
zstd content dict fulltext: 79287 (0.375%)

$ hg perfcompression dom/workers/RuntimeService.cpp
processing 498 revisions with 34560031 fulltext size (69397 mean)
first revision size: 29810
total bdiff size: 308556 (0.89%)
! bdiff
! wall 0.234808 comb 0.230000 user 0.230000 sys 0.000000 (best of 42)
! patch
! wall 0.000198 comb 0.000000 user 0.000000 sys 0.000000 (best of 14129)
! zlib delta chain compress
! wall 0.009020 comb 0.010000 user 0.010000 sys 0.000000 (best of 328)
! zstd delta chain compress
! wall 0.003480 comb 0.010000 user 0.010000 sys 0.000000 (best of 832)
! zstd content dict compress
! wall 0.113726 comb 0.110000 user 0.090000 sys 0.020000 (best of 73)
! zlib delta chain decompress
! wall 0.001686 comb 0.000000 user 0.000000 sys 0.000000 (best of 1735)
! zstd delta chain decompress
! wall 0.001497 comb 0.000000 user 0.000000 sys 0.000000 (best of 1953)
! zstd content dict decompress
! wall 0.001521 comb 0.000000 user 0.000000 sys 0.000000 (best of 1898)
zlib delta chain: 134013 (0.388%)
zstd delta chain: 144284 (0.417%)
zstd content dict fulltext: 103471 (0.299%)

$ hg perfcompression layout/reftests/bugs/reftest.list
processing 1650 revisions with 91315481 fulltext size (55342 mean)
first revision size: 7937
total bdiff size: 346056 (0.38%)
! bdiff
! wall 0.077027 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
! patch
! wall 0.000379 comb 0.000000 user 0.000000 sys 0.000000 (best of 6760)
! zlib delta chain compress
! wall 0.014117 comb 0.020000 user 0.020000 sys 0.000000 (best of 197)
! zstd delta chain compress
! wall 0.005047 comb 0.010000 user 0.010000 sys 0.000000 (best of 559)
! zstd content dict compress
! wall 0.378143 comb 0.380000 user 0.240000 sys 0.140000 (best of 26)
! zlib delta chain decompress
! wall 0.001846 comb 0.000000 user 0.000000 sys 0.000000 (best of 1558)
! zstd delta chain decompress
! wall 0.001932 comb 0.000000 user 0.000000 sys 0.000000 (best of 1481)
! zstd content dict decompress
! wall 0.003821 comb 0.000000 user 0.000000 sys 0.000000 (best of 713)
zlib delta chain: 154599 (0.169%)
zstd delta chain: 170156 (0.186%)
zstd content dict fulltext: 106446 (0.117%)

$ hg perfcompression layout/mathml/nsMathMLChar.cpp
processing 205 revisions with 18945384 fulltext size (92416 mean)
first revision size: 97894
total bdiff size: 160387 (0.85%)
! bdiff
! wall 0.069841 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
! patch
! wall 0.000097 comb 0.000000 user 0.000000 sys 0.000000 (best of 27067)
! zlib delta chain compress
! wall 0.006613 comb 0.000000 user 0.000000 sys 0.000000 (best of 435)
! zstd delta chain compress
! wall 0.001993 comb 0.000000 user 0.000000 sys 0.000000 (best of 1434)
! zstd content dict compress
! wall 0.051319 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
! zlib delta chain decompress
! wall 0.001048 comb 0.000000 user 0.000000 sys 0.000000 (best of 2744)
! zstd delta chain decompress
! wall 0.000772 comb 0.000000 user 0.000000 sys 0.000000 (best of 3735)
! zstd content dict decompress
! wall 0.000840 comb 0.000000 user 0.000000 sys 0.000000 (best of 3470)
zlib delta chain: 86184 (0.455%)
zstd delta chain: 92706 (0.489%)
zstd content dict fulltext: 73206 (0.386%)

$ hg perfcompression parser/html/javasrc/AttributeName.java
processing 25 revisions with 3401095 fulltext size (136043 mean)
first revision size: 128653
total bdiff size: 139174 (4.09%)
! bdiff
! wall 0.004712 comb 0.000000 user 0.000000 sys 0.000000 (best of 627)
! patch
! wall 0.000010 comb 0.000000 user 0.000000 sys 0.000000 (best of 235518)
! zlib delta chain compress
! wall 0.002996 comb 0.010000 user 0.010000 sys 0.000000 (best of 985)
! zstd delta chain compress
! wall 0.000730 comb 0.000000 user 0.000000 sys 0.000000 (best of 3974)
! zstd content dict compress
! wall 0.008503 comb 0.000000 user 0.000000 sys 0.000000 (best of 341)
! zlib delta chain decompress
! wall 0.000492 comb 0.000000 user 0.000000 sys 0.000000 (best of 5814)
! zstd delta chain decompress
! wall 0.000230 comb 0.000000 user 0.000000 sys 0.000000 (best of 11604)
! zstd content dict decompress
! wall 0.000309 comb 0.000000 user 0.000000 sys 0.000000 (best of 8966)
zlib delta chain: 41601 (1.223%)
zstd delta chain: 43867 (1.290%)
zstd content dict fulltext: 37583 (1.105%)

$ hg perfcompression js/src/jsapi.h
processing 449 revisions with 93510685 fulltext size (208264 mean)
first revision size: 181383
total bdiff size: 316006 (0.34%)
! bdiff
! wall 0.287000 comb 0.290000 user 0.280000 sys 0.010000 (best of 35)
! patch
! wall 0.000175 comb 0.000000 user 0.000000 sys 0.000000 (best of 14603)
! zlib delta chain compress
! wall 0.012882 comb 0.010000 user 0.010000 sys 0.000000 (best of 228)
! zstd delta chain compress
! wall 0.004024 comb 0.000000 user 0.000000 sys 0.000000 (best of 724)
! zstd content dict compress
! wall 0.381624 comb 0.380000 user 0.280000 sys 0.100000 (best of 26)
! zlib delta chain decompress
! wall 0.002110 comb 0.000000 user 0.000000 sys 0.000000 (best of 1345)
! zstd delta chain decompress
! wall 0.001515 comb 0.010000 user 0.010000 sys 0.000000 (best of 1917)
! zstd content dict decompress
! wall 0.004198 comb 0.000000 user 0.000000 sys 0.000000 (best of 698)
zlib delta chain: 170232 (0.182%)
zstd delta chain: 182294 (0.195%)
zstd content dict fulltext: 138261 (0.148%)

$ hg perfcompression toolkit/mozapps/extensions/internal/XPIProvider.jsm
processing 299 revisions with 83047277 fulltext size (277750 mean)
first revision size: 257535
total bdiff size: 537092 (0.65%)
! bdiff
! wall 0.289529 comb 0.290000 user 0.290000 sys 0.000000 (best of 34)
! patch
! wall 0.000194 comb 0.000000 user 0.000000 sys 0.000000 (best of 14683)
! zlib delta chain compress
! wall 0.018923 comb 0.020000 user 0.020000 sys 0.000000 (best of 153)
! zstd delta chain compress
! wall 0.005258 comb 0.000000 user 0.000000 sys 0.000000 (best of 545)
! zstd content dict compress
! wall 0.331014 comb 0.340000 user 0.240000 sys 0.100000 (best of 30)
! zlib delta chain decompress
! wall 0.003091 comb 0.000000 user 0.000000 sys 0.000000 (best of 952)
! zstd delta chain decompress
! wall 0.001859 comb 0.000000 user 0.000000 sys 0.000000 (best of 1568)
! zstd content dict decompress
! wall 0.003986 comb 0.010000 user 0.010000 sys 0.000000 (best of 716)
zlib delta chain: 250522 (0.302%)
zstd delta chain: 268601 (0.323%)
zstd content dict fulltext: 220039 (0.265%)

$ hg perfcompression db/sqlite3/src/sqlite3.c
processing 80 revisions with 387843202 fulltext size (4848040 mean)
first revision size: 3264462
total bdiff size: 21853263 (5.63%)
! bdiff
! wall 2.125062 comb 2.130000 user 2.110000 sys 0.020000 (best of 5)
! patch
! wall 0.003611 comb 0.010000 user 0.010000 sys 0.000000 (best of 799)
! zlib delta chain compress
! wall 0.734215 comb 0.740000 user 0.740000 sys 0.000000 (best of 14)
! zstd delta chain compress
! wall 0.119384 comb 0.120000 user 0.120000 sys 0.000000 (best of 83)
! zstd content dict compress
! wall 2.628571 comb 2.630000 user 2.610000 sys 0.020000 (best of 4)
! zlib delta chain decompress
! wall 0.075535 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
! zstd delta chain decompress
! wall 0.029052 comb 0.030000 user 0.030000 sys 0.000000 (best of 100)
! zstd content dict decompress
! wall 0.409130 comb 0.410000 user 0.410000 sys 0.000000 (best of 25)
zlib delta chain: 7279707 (1.877%)
zstd delta chain: 6915081 (1.783%)
zstd content dict fulltext: 82936027 (21.384%)
I need to perform an exhaustive test, but I think I'm starting to see some patterns here:
* bdiff patch times are so fast that read times are almost always dominated by decompression
* bdiff generation often dwarfs time spent compressing bdiff output (and we made bdiff ~2x faster in Mercurial 4.1, so this will likely do more than zstd will)
* content-only dict compression is frequently smaller than compressed bdiffs
* content-only dict compression feels like it is about the same or slower than bdiff+zlib on average
* content-only dict decompression seems sensitive to chain lengths and raw content size; it outperforms bdiff delta chains for small chains but is slower for longer/larger chains (this might be my presumed memory I/O limitations hitting us)
* limited zstd window size almost certainly made sqlite.c's compression size blow up when using content-only dict compression
* we might be seeing smaller content-only dict sizes because our current bdiff algorithm is line-based, which means there could be lots of extra content in the diff chunks

If anyone wants to play around, code lives in the zstd bookmark of https://hg.mozilla.org/users/gszorc_mozilla.com/hg<https://urldefense.proofpoint.com/v2/url?u=https-3A__hg.mozilla.org_users_gszorc-5Fmozilla.com_hg&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=CBvG4rQDrKNwe9WtU05s-8cWZAnz1ZcS3McU4BD91BI&s=qIHPPUB1bC-UzmsQZL1GXR08KcVMYD-nY4k9enhqZHQ&e=> e.g. https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/zstd<https://urldefense.proofpoint.com/v2/url?u=https-3A__hg.mozilla.org_users_gszorc-5Fmozilla.com_hg_rev_zstd&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=_EwH5jTAHV32G13ENXwVrw&m=CBvG4rQDrKNwe9WtU05s-8cWZAnz1ZcS3McU4BD91BI&s=kxYxnnLXqO3JlaYXZ9onUWS6OGWYA2ZDMc_Faxp68lA&e=>
Also, since I haven't mentioned it yet, one downside to content-only dict chaining is that it precludes parallel decompression.
So many variables...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170117/780b69de/attachment.html>