[PATCH 8 of 8 zstd-revlogs] [RFC] localrepo: support non-zlib compression engines in revlogs
Pierre-Yves David
pierre-yves.david at ens-lyon.org
Fri Jan 13 09:38:09 EST 2017
On 01/03/2017 12:57 AM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1483401236 28800
> # Mon Jan 02 15:53:56 2017 -0800
> # Node ID cd8f8b1b8dbae0cf2be79a732a936f4105881672
> # Parent 96837453d041fe7289dd3e6450aee4961517c5b5
> [RFC] localrepo: support non-zlib compression engines in revlogs
Overall the code and the changes looks good to me. I gave my view on
most of the question you raised and made a small proposal to rename the
config from 'format.compengine' to 'format.compression'.
I'm looking forward to see a non-RFC version!
>
> The final part of integrating the compression manager APIs into
> revlog storage is the plumbing for repositories to advertise they
> are using non-zlib storage and for revlogs to instantiate a non-zlib
> engine.
>
> Introduced is a format.compengine config option to control the
> compression engine to use. It defaults to zlib, which is backwards
> compatible.
>
> localrepo.newreporequirements() looks at this config option and adds
> a "compengine*" requirement on the named compression engine if it isn't
> "zlib." Doing it at this layer means that existing repositories will
> forever create new data as the engine they were initially created
> with: only new repositories get the new compression engine.
>
> The svfs opener options now define a "compengine" key denoting the
> preferred compression engine. This comes from the repo's active
> requirements.
>
> A localrepository instance now updates its supported requirements
> to include "compengine*" values for compression engines registered
> with the compression engine manager.
>
> Finally, revlog instances have been taught to instantiate non-zlib
> compression options if the opener options instruct it to.
>
> This patch means that we can use zstd compression in revlogs!
>
> On an `hg pull` conversion of the mozilla-unified repo with no extra
> redelta settings (like aggressivemergedeltas), we can see the impact
> of zstd vs zlib in revlogs:
>
> $ hg perfrevlogchunks -c
> ! chunk
> ! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5)
> ! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6)
>
> ! chunk batch
> ! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6)
> ! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6)
>
> $ hg perfrevlogchunks -m
> ! chunk
> ! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4)
> ! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5)
>
> ! chunk batch
> ! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4)
> ! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6)
>
> $ hg perfrevlog -c -d 1
> ! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3)
> ! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)
>
> $ hg perfrevlog -m -d 1000
> ! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3)
> ! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3)
>
> Only modest wins for the changelog. But manifest reading is
> significantly faster. I presume this is because manifest chunks tend
> to be larger and zstd's performance advantage is more apparent the
> longer it runs.
>
> debugcreatestreamclonebundle (size in bytes)
> zlib: 1,638,852,492
> zstd: 1,680,601,332
>
> I'm not sure why zstd on disk is larger. It could be overhead of its
> framing headers. It could also be poor delta parent choices, as I
> didn't force redelta on the zstd conversion of the repo.
>
> $ hg bundle -a -t none-v2 (total CPU time)
> zlib: 102.79s
> zstd: 97.75s
>
> So, marginal CPU decrease for reading all chunks in all revlogs
> (this is somewhat disappointing).
>
> $ hg bundle -a -t <engine>-v2 (total CPU time)
> zlib: 191.59s
> zstd: 115.36s
>
> This last test effectively measures the difference between zlib->zlib
> and zstd->zstd for revlogs to bundle. This is a rough approximation of
> what a server does during `hg clone`.
>
> Also, I haven't exhaustively tested and profiled various zstd APIs
> for decompression. The C bindings expose a handful of ways to perform
> decompression and it's quite possible there are some performance
> optimizations on the table there.
>
> FWIW, I have a local patch series where I enable support for lz4
> revlogs using the same lz4 C extension used by Facebook's lz4revlogs
> extension and surprisingly zstd's decompress speed is faster than
> lz4! This shouldn't be true because, well, it is assumed lz4
> decompress is insanely fast. I suspect this has more to do with the
> lz4 C extension doing something silly, like reinitializing contexts.
> But that's a rabbit hole for another day.
>
> ---
>
> Things to hash out making this an RFC quality patch:
>
> * Is using a wildcard in the requirement name acceptable? Perhaps we
> should use e.g. compengine=zstd instead?
Explicit name of the compengine in the requirement seems clearer to me
and better.
> * Does the requirement need to be specific to revlogs?
I'm not sure what you mean here. Are you thinking about compression
applied to non-revlogs file? Do we have any of that right now? What
would be the benefit to not use the same compression for all?
>
> * Should the requirement name be explicitly declared by the engine
> instead of inferred by the engine name? (A extra layer of
> abstraction may help here if a single engine ever supports multiple
> requirements, like the zstd one will if dictionary compression is
> added.)
Having explicit declaration of the requirement name seems clearer to me
(and therefor better).
> * Do more data due diligence and record in commit message.
As long as we are not moving the default to zstd, I don't think we need
the full data set that would be necessary to
> * stream clone bundle requirements testing
>
> * More tests around revlog format (verification of zstd frame header)
>
> diff --git a/mercurial/help/config.txt b/mercurial/help/config.txt
> --- a/mercurial/help/config.txt
> +++ b/mercurial/help/config.txt
> @@ -687,6 +687,11 @@ Example for ``~/.hgrc``::
> ``format``
> ----------
>
> +``compengine``
> + Compression engine to use for storage.
> +
> + Default: ``zlib``
Any reason to use "compengine" as the name it read a bit obscure to me.
What about "compression" ?
> +
> ``usegeneraldelta``
> Enable or disable the "generaldelta" repository format which improves
> repository compression by allowing "revlog" to store delta against arbitrary
> diff --git a/mercurial/help/internals/requirements.txt b/mercurial/help/internals/requirements.txt
> --- a/mercurial/help/internals/requirements.txt
> +++ b/mercurial/help/internals/requirements.txt
> @@ -106,3 +106,12 @@ one manifest per directory (as opposed t
> Support for this requirement was added in Mercurial 3.4 (released
> August 2015). The requirement is currently experimental and is
> disabled by default.
> +
> +compengine*
> +===========
> +
> +Denotes a compression engine required to open files in the repository.
> +``*`` is a compression engine name, such as ``zstd``.
> +
> +Support for this requirement was added in Mercurial 4.1 (released
> +February 2017).
Don't forget to also update the wiki page.
> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -284,6 +284,12 @@ class localrepository(object):
> else:
> self.supported = self._basesupported
>
> + # Add compression engines.
> + for name in util.compengines:
> + engine = util.compengines[name]
> + if engine.revlogheader:
> + self.supported.add('compengine%s' % name)
> +
> if not self.vfs.isdir():
> if create:
> self.requirements = newreporequirements(self)
> @@ -397,6 +403,10 @@ class localrepository(object):
> self.svfs.options['aggressivemergedeltas'] = aggressivemergedeltas
> self.svfs.options['lazydeltabase'] = not scmutil.gddeltaconfig(self.ui)
>
> + for r in self.requirements:
> + if r.startswith('compengine'):
> + self.svfs.options['compengine'] = r[len('compengine'):]
> +
> def _writerequirements(self):
> scmutil.writerequires(self.vfs, self.requirements)
>
> @@ -1994,6 +2004,17 @@ def newreporequirements(repo):
> if ui.configbool('format', 'dotencode', True):
> requirements.add('dotencode')
>
> + compengine = ui.config('format', 'compengine', 'zlib')
> + if compengine not in util.compengines:
> + raise error.Abort(_('compression engine %s defined by '
> + 'format.compengine not available') % compengine,
> + hint=_('run "hg debuginstall" to list available '
> + 'compression engines'))
> +
> + # zlib is the historical default and doesn't need an explicit requirement.
> + if compengine != 'zlib':
> + requirements.add('compengine%s' % compengine)
> +
> if scmutil.gdinitconfig(ui):
> requirements.add('generaldelta')
> if ui.configbool('experimental', 'treemanifest', False):
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -228,6 +228,7 @@ class revlog(object):
> # Mapping of revision integer to full node.
> self._nodecache = {nullid: nullrev}
> self._nodepos = None
> + self._compengine = 'zlib'
>
> v = REVLOG_DEFAULT_VERSION
> opts = getattr(opener, 'options', None)
> @@ -244,6 +245,8 @@ class revlog(object):
> if 'aggressivemergedeltas' in opts:
> self._aggressivemergedeltas = opts['aggressivemergedeltas']
> self._lazydeltabase = bool(opts.get('lazydeltabase', False))
> + if 'compengine' in opts:
> + self._compengine = opts['compengine']
>
> if self._chunkcachesize <= 0:
> raise RevlogError(_('revlog chunk cache size %r is not greater '
> @@ -301,7 +304,7 @@ class revlog(object):
>
> @util.propertycache
> def _compressor(self):
> - return util.compengines['zlib'].revlogcompressor()
> + return util.compengines[self._compengine].revlogcompressor()
>
> def tip(self):
> return self.node(len(self.index) - 2)
> diff --git a/tests/test-help.t b/tests/test-help.t
> --- a/tests/test-help.t
> +++ b/tests/test-help.t
> @@ -1223,6 +1223,8 @@ Separate sections from subsections
> "format"
> --------
>
> + "compengine"
> +
> "usegeneraldelta"
>
> "dotencode"
> diff --git a/tests/test-repo-compengines.t b/tests/test-repo-compengines.t
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-repo-compengines.t
> @@ -0,0 +1,78 @@
> +A new repository uses zlib storage, which doesn't require a requirement
> +
> + $ hg init default
> + $ cd default
> + $ cat .hg/requires
> + dotencode
> + fncache
> + generaldelta
> + revlogv1
> + store
> +
> + $ touch foo
> + $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text to trigger compression'
> + $ hg debugrevlog -c | grep 0x78
> + 0x78 (x) : 1 (100.00%)
> + 0x78 (x) : 110 (100.00%)
> +
> + $ cd ..
> +
> +Unknown compression engine to format.compengine aborts
> +
> + $ hg --config format.compengine=unknown init unknown
> + abort: compression engine unknown defined by format.compengine not available
> + (run "hg debuginstall" to list available compression engines)
> + [255]
> +
> +A requirement specifying an unknown compression engine results in bail
> +
> + $ hg init unknownrequirement
> + $ cd unknownrequirement
> + $ echo compengineunknown >> .hg/requires
> + $ hg log
> + abort: repository requires features unknown to this Mercurial: compengineunknown!
> + (see https://mercurial-scm.org/wiki/MissingRequirement for more information)
> + [255]
> +
> + $ cd ..
> +
> +#if zstd
> +
> + $ hg --config format.compengine=zstd init zstd
> + $ cd zstd
> + $ cat .hg/requires
> + compenginezstd
> + dotencode
> + fncache
> + generaldelta
> + revlogv1
> + store
> +
> + $ touch foo
> + $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text'
> +
> + $ hg debugrevlog -c | grep 0x28
> + 0x28 : 1 (100.00%)
> + 0x28 : 98 (100.00%)
> +
> + $ cd ..
> +
> +Specifying a new format.compengine on an existing repo won't introduce data
> +with that engine or a requirement
> +
> + $ cd default
> + $ touch bar
> + $ hg --config format.compengine=zstd -q commit -A -m 'add bar with a lot of repeated repeated repeated text'
> +
> + $ cat .hg/requires
> + dotencode
> + fncache
> + generaldelta
> + revlogv1
> + store
> +
> + $ hg debugrevlog -c | grep 0x78
> + 0x78 (x) : 2 (100.00%)
> + 0x78 (x) : 199 (100.00%)
> +
> +#endif
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
--
Pierre-Yves David
More information about the Mercurial-devel
mailing list