[PATCH 8 of 8 zstd-revlogs] [RFC] localrepo: support non-zlib compression engines in revlogs

Pierre-Yves David pierre-yves.david at ens-lyon.org
Fri Jan 13 09:38:09 EST 2017



On 01/03/2017 12:57 AM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1483401236 28800
> #      Mon Jan 02 15:53:56 2017 -0800
> # Node ID cd8f8b1b8dbae0cf2be79a732a936f4105881672
> # Parent  96837453d041fe7289dd3e6450aee4961517c5b5
> [RFC] localrepo: support non-zlib compression engines in revlogs

Overall the code and the changes looks good to me. I gave my view on 
most of the question you raised and made a small proposal to rename the 
config from 'format.compengine' to 'format.compression'.

I'm looking forward to see a non-RFC version!

>
> The final part of integrating the compression manager APIs into
> revlog storage is the plumbing for repositories to advertise they
> are using non-zlib storage and for revlogs to instantiate a non-zlib
> engine.
>
> Introduced is a format.compengine config option to control the
> compression engine to use. It defaults to zlib, which is backwards
> compatible.
>
> localrepo.newreporequirements() looks at this config option and adds
> a "compengine*" requirement on the named compression engine if it isn't
> "zlib." Doing it at this layer means that existing repositories will
> forever create new data as the engine they were initially created
> with: only new repositories get the new compression engine.
>
> The svfs opener options now define a "compengine" key denoting the
> preferred compression engine. This comes from the repo's active
> requirements.
>
> A localrepository instance now updates its supported requirements
> to include "compengine*" values for compression engines registered
> with the compression engine manager.
>
> Finally, revlog instances have been taught to instantiate non-zlib
> compression options if the opener options instruct it to.
>
> This patch means that we can use zstd compression in revlogs!
>
> On an `hg pull` conversion of the mozilla-unified repo with no extra
> redelta settings (like aggressivemergedeltas), we can see the impact
> of zstd vs zlib in revlogs:
>
> $ hg perfrevlogchunks -c
> ! chunk
> ! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5)
> ! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6)
>
> ! chunk batch
> ! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6)
> ! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6)
>
> $ hg perfrevlogchunks -m
> ! chunk
> ! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4)
> ! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5)
>
> ! chunk batch
> ! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4)
> ! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6)
>
> $ hg perfrevlog -c -d 1
> ! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3)
> ! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)
>
> $ hg perfrevlog -m -d 1000
> ! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3)
> ! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3)
>
> Only modest wins for the changelog. But manifest reading is
> significantly faster. I presume this is because manifest chunks tend
> to be larger and zstd's performance advantage is more apparent the
> longer it runs.
>
> debugcreatestreamclonebundle (size in bytes)
> zlib: 1,638,852,492
> zstd: 1,680,601,332
>
> I'm not sure why zstd on disk is larger. It could be overhead of its
> framing headers. It could also be poor delta parent choices, as I
> didn't force redelta on the zstd conversion of the repo.
>
> $ hg bundle -a -t none-v2 (total CPU time)
> zlib: 102.79s
> zstd:  97.75s
>
> So, marginal CPU decrease for reading all chunks in all revlogs
> (this is somewhat disappointing).
>
> $ hg bundle -a -t <engine>-v2 (total CPU time)
> zlib: 191.59s
> zstd: 115.36s
>
> This last test effectively measures the difference between zlib->zlib
> and zstd->zstd for revlogs to bundle. This is a rough approximation of
> what a server does during `hg clone`.
>
> Also, I haven't exhaustively tested and profiled various zstd APIs
> for decompression. The C bindings expose a handful of ways to perform
> decompression and it's quite possible there are some performance
> optimizations on the table there.
>
> FWIW, I have a local patch series where I enable support for lz4
> revlogs using the same lz4 C extension used by Facebook's lz4revlogs
> extension and surprisingly zstd's decompress speed is faster than
> lz4! This shouldn't be true because, well, it is assumed lz4
> decompress is insanely fast. I suspect this has more to do with the
> lz4 C extension doing something silly, like reinitializing contexts.
> But that's a rabbit hole for another day.
>
> ---
>
> Things to hash out making this an RFC quality patch:
>
> * Is using a wildcard in the requirement name acceptable? Perhaps we
>   should use e.g. compengine=zstd instead?

Explicit name of the compengine in the requirement seems clearer to me 
and better.

> * Does the requirement need to be specific to revlogs?

I'm not sure what you mean here. Are you thinking about compression 
applied to non-revlogs file? Do we have any of that right now? What 
would be the benefit to not use the same compression for all?

>
> * Should the requirement name be explicitly declared by the engine
>   instead of inferred by the engine name? (A extra layer of
>   abstraction may help here if a single engine ever supports multiple
>   requirements, like the zstd one will if dictionary compression is
>   added.)

Having explicit declaration of the requirement name seems clearer to me 
(and therefor better).

> * Do more data due diligence and record in commit message.

As long as we are not moving the default to zstd, I don't think we need 
the full data set that would be necessary to

> * stream clone bundle requirements testing
>
> * More tests around revlog format (verification of zstd frame header)
>
> diff --git a/mercurial/help/config.txt b/mercurial/help/config.txt
> --- a/mercurial/help/config.txt
> +++ b/mercurial/help/config.txt
> @@ -687,6 +687,11 @@ Example for ``~/.hgrc``::
>  ``format``
>  ----------
>
> +``compengine``
> +    Compression engine to use for storage.
> +
> +    Default: ``zlib``

Any reason to use "compengine" as the name it read a bit obscure to me. 
What about "compression" ?

> +
>  ``usegeneraldelta``
>      Enable or disable the "generaldelta" repository format which improves
>      repository compression by allowing "revlog" to store delta against arbitrary
> diff --git a/mercurial/help/internals/requirements.txt b/mercurial/help/internals/requirements.txt
> --- a/mercurial/help/internals/requirements.txt
> +++ b/mercurial/help/internals/requirements.txt
> @@ -106,3 +106,12 @@ one manifest per directory (as opposed t
>  Support for this requirement was added in Mercurial 3.4 (released
>  August 2015). The requirement is currently experimental and is
>  disabled by default.
> +
> +compengine*
> +===========
> +
> +Denotes a compression engine required to open files in the repository.
> +``*`` is a compression engine name, such as ``zstd``.
> +
> +Support for this requirement was added in Mercurial 4.1 (released
> +February 2017).


Don't forget to also update the wiki page.

> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -284,6 +284,12 @@ class localrepository(object):
>          else:
>              self.supported = self._basesupported
>
> +        # Add compression engines.
> +        for name in util.compengines:
> +            engine = util.compengines[name]
> +            if engine.revlogheader:
> +                self.supported.add('compengine%s' % name)
> +
>          if not self.vfs.isdir():
>              if create:
>                  self.requirements = newreporequirements(self)
> @@ -397,6 +403,10 @@ class localrepository(object):
>          self.svfs.options['aggressivemergedeltas'] = aggressivemergedeltas
>          self.svfs.options['lazydeltabase'] = not scmutil.gddeltaconfig(self.ui)
>
> +        for r in self.requirements:
> +            if r.startswith('compengine'):
> +                self.svfs.options['compengine'] = r[len('compengine'):]
> +
>      def _writerequirements(self):
>          scmutil.writerequires(self.vfs, self.requirements)
>
> @@ -1994,6 +2004,17 @@ def newreporequirements(repo):
>              if ui.configbool('format', 'dotencode', True):
>                  requirements.add('dotencode')
>
> +    compengine = ui.config('format', 'compengine', 'zlib')
> +    if compengine not in util.compengines:
> +        raise error.Abort(_('compression engine %s defined by '
> +                            'format.compengine not available') % compengine,
> +                          hint=_('run "hg debuginstall" to list available '
> +                                 'compression engines'))
> +
> +    # zlib is the historical default and doesn't need an explicit requirement.
> +    if compengine != 'zlib':
> +        requirements.add('compengine%s' % compengine)
> +
>      if scmutil.gdinitconfig(ui):
>          requirements.add('generaldelta')
>      if ui.configbool('experimental', 'treemanifest', False):
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -228,6 +228,7 @@ class revlog(object):
>          # Mapping of revision integer to full node.
>          self._nodecache = {nullid: nullrev}
>          self._nodepos = None
> +        self._compengine = 'zlib'
>
>          v = REVLOG_DEFAULT_VERSION
>          opts = getattr(opener, 'options', None)
> @@ -244,6 +245,8 @@ class revlog(object):
>              if 'aggressivemergedeltas' in opts:
>                  self._aggressivemergedeltas = opts['aggressivemergedeltas']
>              self._lazydeltabase = bool(opts.get('lazydeltabase', False))
> +            if 'compengine' in opts:
> +                self._compengine = opts['compengine']
>
>          if self._chunkcachesize <= 0:
>              raise RevlogError(_('revlog chunk cache size %r is not greater '
> @@ -301,7 +304,7 @@ class revlog(object):
>
>      @util.propertycache
>      def _compressor(self):
> -        return util.compengines['zlib'].revlogcompressor()
> +        return util.compengines[self._compengine].revlogcompressor()
>
>      def tip(self):
>          return self.node(len(self.index) - 2)
> diff --git a/tests/test-help.t b/tests/test-help.t
> --- a/tests/test-help.t
> +++ b/tests/test-help.t
> @@ -1223,6 +1223,8 @@ Separate sections from subsections
>        "format"
>        --------
>
> +      "compengine"
> +
>        "usegeneraldelta"
>
>        "dotencode"
> diff --git a/tests/test-repo-compengines.t b/tests/test-repo-compengines.t
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-repo-compengines.t
> @@ -0,0 +1,78 @@
> +A new repository uses zlib storage, which doesn't require a requirement
> +
> +  $ hg init default
> +  $ cd default
> +  $ cat .hg/requires
> +  dotencode
> +  fncache
> +  generaldelta
> +  revlogv1
> +  store
> +
> +  $ touch foo
> +  $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text to trigger compression'
> +  $ hg debugrevlog -c | grep 0x78
> +      0x78 (x)  :   1 (100.00%)
> +      0x78 (x)  : 110 (100.00%)
> +
> +  $ cd ..
> +
> +Unknown compression engine to format.compengine aborts
> +
> +  $ hg --config format.compengine=unknown init unknown
> +  abort: compression engine unknown defined by format.compengine not available
> +  (run "hg debuginstall" to list available compression engines)
> +  [255]
> +
> +A requirement specifying an unknown compression engine results in bail
> +
> +  $ hg init unknownrequirement
> +  $ cd unknownrequirement
> +  $ echo compengineunknown >> .hg/requires
> +  $ hg log
> +  abort: repository requires features unknown to this Mercurial: compengineunknown!
> +  (see https://mercurial-scm.org/wiki/MissingRequirement for more information)
> +  [255]
> +
> +  $ cd ..
> +
> +#if zstd
> +
> +  $ hg --config format.compengine=zstd init zstd
> +  $ cd zstd
> +  $ cat .hg/requires
> +  compenginezstd
> +  dotencode
> +  fncache
> +  generaldelta
> +  revlogv1
> +  store
> +
> +  $ touch foo
> +  $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text'
> +
> +  $ hg debugrevlog -c | grep 0x28
> +      0x28      :  1 (100.00%)
> +      0x28      : 98 (100.00%)
> +
> +  $ cd ..
> +
> +Specifying a new format.compengine on an existing repo won't introduce data
> +with that engine or a requirement
> +
> +  $ cd default
> +  $ touch bar
> +  $ hg --config format.compengine=zstd -q commit -A -m 'add bar with a lot of repeated repeated repeated text'
> +
> +  $ cat .hg/requires
> +  dotencode
> +  fncache
> +  generaldelta
> +  revlogv1
> +  store
> +
> +  $ hg debugrevlog -c | grep 0x78
> +      0x78 (x)  :   2 (100.00%)
> +      0x78 (x)  : 199 (100.00%)
> +
> +#endif
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>

-- 
Pierre-Yves David


More information about the Mercurial-devel mailing list