[PATCH 8 of 8 zstd-revlogs] [RFC] localrepo: support non-zlib compression engines in revlogs

Thu Jan 5 02:51:18 EST 2017

On Wed, Jan 4, 2017 at 11:27 PM, Mike Hommey <mh at glandium.org> wrote:

> On Wed, Jan 04, 2017 at 11:18:21PM -0800, Gregory Szorc wrote:
> > * The lz4 performance note in the commit message isn't very accurate.
> There
> > is a small subset of operations where the zstd python bindings are as
> fast
> > as lz4. I'll strike the comment from the next version.
> >
> > * zlib has checksums built into the compression format with how it is
> used
> > in hg today. The patches as written do not have zstd writing checksums.
> >
> > * Enabling checksums in zstd appears to have a negligible impact on
> > performance.
> >
> > * Reusing zstd compression and decompression "contexts" can make a
> > significant difference to performance. Having a reusable "compressor"
> > object that allows "context" reuse should increase performance for zstd.
> >
> > * For the changelog, zstd level=1 versus level=3 makes almost no
> difference
> > on compression ratio but does speed up compression a bit. Now I'm
> > considering per-revlog settings for the compressors.
> >
> > * zstd compression dictionaries speed up *both* compression and
> > decompression. On changelog chunks, dictionaries improve decompress
> > throughput from ~180 MB/s to ~300 MB/s. That's nothing to sneeze at.
> >
> > * When dictionaries are used, zstd level=1 compresses the changelog
> > considerably faster than level=3. ~160 MB/s vs ~27 MB/s.
> >
> > * I was going to hold off seriously investigating compression
> dictionaries,
> > but since there are massive perf win potentials, I think it should be
> done
> > sooner than later.
>
> All these perf information wrt dictionaries make me wonder if there is a
> corpus of non-english changesets that could be used for some different
> performance measurements. It's nice that we know things are better for
> english content, but version control is not exclusive to people writing
> everything in english.

Ideally I'd think that dictionaries would be per-repository, calculated
from the data within. That would yield ideal compression ratios. As long as
we're not transferring compressed frames tied to a specific dictionary
across the wire [to a peer without that dictionary], we should be fine. Of
course, that introduces the complication of when exactly and how do you
compute the dictionary. You need data to seed it, which gets weird for
things like clones, since you start from nothing. And dictionaries can
change over time if the underlying data changes. So now you are in
rewriting revlogs or maintaining multiple dictionaries territory. The zstd
frame contains the dictionary ID, so multiple dictionaries is a thing we
can support.

Other ideas and complications:

* Server hands out precomputed dictionary at clone time. Client applies the
dictionary when compressing received changegroup data. In this scenario, we
can instruct server operators to install a CRON to periodically regenerate
the dictionary data.
* We seed a dictionary from N popular public VCS repos, bake it into the
Mercurial distribution, and defer dynamic dictionary complexities to later.
* Computing a dictionary is very CPU intensive. I think it is an order of
magnitude slower than compression. It's not something we can just do
whenever we feel like it.

Also, no matter how many bytes I tell zstd to make the dictionary, it
always comes back with something around 100k (at least for
mozilla-unified's changelog and manifest). We're not talking about a lot of
data for the gain it yields. It's the dictionary management I worry about.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170104/f41c4f67/attachment.html>