Deltas, compression, and zstd

Sun Jan 15 13:22:58 EST 2017

From Gregory Szorc :

Ø  The wording in this paragraph is confusing. Stated differently: if we have immutable files (e.g. "pack files"), then there are no stream append operations and all the complications around solving that problem go away. So "pack files" allow us to more easily employ the feed-in-fulltexts strategy I outlined.

Ø  Also, at various points I used "zstd frame" when I meant to say "zstd block." A zstd "frame" is a collection of "blocks" plus a header and epilogue giving instructions on how to decode the block data. Each "frame" is completely independent. This means that the fulltext streaming mechanism I described operates at the zstd "block" level and not the "frame" level.

It is unclear whether this means the zstd streaming C APIs are appropriate for this use case or whether we need to leverage the lower-level "block functions" defined by the C API. I /think/ the streaming C APIs operate at block boundaries, meaning that if you call flush() during compression you get an end of block and that if you feed a whole block into a decompressor that output will flush automatically. What I'm not sure of is whether using the C APIs in this manner is supported or condoned. Strictly speaking, we start an operation on a frame then abort midway through. There are some consequences to that, notably not verifying a checksum of the decompressed data (which is in an epilogue at the end of a frame). I'm sure Yann can clarify.

For this usage, another potential suggestion could be to use dictionary compression in “content-only” mode.

Here is how it could work :
Presuming you have a “first original file”, and you want to send a second file as delta to the first :
Compress the second file, using the first file as “dictionary”.
(use ZSTD_compress_usingDict() for this scenario).

This is not a properly formatted zstd dictionary, so it triggers “content only node”,
which means it has no statistics, hence no way to precompute anything.
But it can use the dictionary content for back-references.
Hence common parts should be referenced as an LZ copy operation.

This process can be chained, so that a file3 could use file2 as dictionary, and so on.

Advantage : each file is compressed into a complete frame, so there is nothing “special” to do, API is already there.
Specifically, there is no “compression state” to keep in memory between files,
and no “unfinished” frame is stored.

There are some disadvantages I can think of, but I believe they can be worked out :

-          In decompression, referencing a “content-only” dictionary is an (almost) zero cost operation.
For compression, it’s more expensive : the dictionary is analyzed in order to properly reference its content for subsequent search operations.
The speed of this operations depends on both the size of dictionary and the compression level.
So it could prove slow (but I have no figure to show).
If it is slow, an optimization pass for content-referencing could be performed. It could also be Mercurial specific, it would not affect the format.

-          If previous file is used as dictionary, then it’s not possible to use another “precomputed dictionary” at the same time. It’s one or the other.
Now, this is current situation, but it could evolve. At the very least, some Mercurial-specific extension could solve this issue.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170115/46d7869d/attachment.html>