[PATCH 8 of 8 zstd-revlogs] [RFC] localrepo: support non-zlib compression engines in revlogs

Gregory Szorc gregory.szorc at gmail.com
Mon Jan 2 18:57:59 EST 2017


# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1483401236 28800
#      Mon Jan 02 15:53:56 2017 -0800
# Node ID cd8f8b1b8dbae0cf2be79a732a936f4105881672
# Parent  96837453d041fe7289dd3e6450aee4961517c5b5
[RFC] localrepo: support non-zlib compression engines in revlogs

The final part of integrating the compression manager APIs into
revlog storage is the plumbing for repositories to advertise they
are using non-zlib storage and for revlogs to instantiate a non-zlib
engine.

Introduced is a format.compengine config option to control the
compression engine to use. It defaults to zlib, which is backwards
compatible.

localrepo.newreporequirements() looks at this config option and adds
a "compengine*" requirement on the named compression engine if it isn't
"zlib." Doing it at this layer means that existing repositories will
forever create new data as the engine they were initially created
with: only new repositories get the new compression engine.

The svfs opener options now define a "compengine" key denoting the
preferred compression engine. This comes from the repo's active
requirements.

A localrepository instance now updates its supported requirements
to include "compengine*" values for compression engines registered
with the compression engine manager.

Finally, revlog instances have been taught to instantiate non-zlib
compression options if the opener options instruct it to.

This patch means that we can use zstd compression in revlogs!

On an `hg pull` conversion of the mozilla-unified repo with no extra
redelta settings (like aggressivemergedeltas), we can see the impact
of zstd vs zlib in revlogs:

$ hg perfrevlogchunks -c
! chunk
! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5)
! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6)

! chunk batch
! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6)
! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6)

$ hg perfrevlogchunks -m
! chunk
! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4)
! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5)

! chunk batch
! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4)
! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6)

$ hg perfrevlog -c -d 1
! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3)
! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)

$ hg perfrevlog -m -d 1000
! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3)
! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3)

Only modest wins for the changelog. But manifest reading is
significantly faster. I presume this is because manifest chunks tend
to be larger and zstd's performance advantage is more apparent the
longer it runs.

debugcreatestreamclonebundle (size in bytes)
zlib: 1,638,852,492
zstd: 1,680,601,332

I'm not sure why zstd on disk is larger. It could be overhead of its
framing headers. It could also be poor delta parent choices, as I
didn't force redelta on the zstd conversion of the repo.

$ hg bundle -a -t none-v2 (total CPU time)
zlib: 102.79s
zstd:  97.75s

So, marginal CPU decrease for reading all chunks in all revlogs
(this is somewhat disappointing).

$ hg bundle -a -t <engine>-v2 (total CPU time)
zlib: 191.59s
zstd: 115.36s

This last test effectively measures the difference between zlib->zlib
and zstd->zstd for revlogs to bundle. This is a rough approximation of
what a server does during `hg clone`.

Also, I haven't exhaustively tested and profiled various zstd APIs
for decompression. The C bindings expose a handful of ways to perform
decompression and it's quite possible there are some performance
optimizations on the table there.

FWIW, I have a local patch series where I enable support for lz4
revlogs using the same lz4 C extension used by Facebook's lz4revlogs
extension and surprisingly zstd's decompress speed is faster than
lz4! This shouldn't be true because, well, it is assumed lz4
decompress is insanely fast. I suspect this has more to do with the
lz4 C extension doing something silly, like reinitializing contexts.
But that's a rabbit hole for another day.

---

Things to hash out making this an RFC quality patch:

* Is using a wildcard in the requirement name acceptable? Perhaps we
  should use e.g. compengine=zstd instead?

* Does the requirement need to be specific to revlogs?

* Should the requirement name be explicitly declared by the engine
  instead of inferred by the engine name? (A extra layer of
  abstraction may help here if a single engine ever supports multiple
  requirements, like the zstd one will if dictionary compression is
  added.)

* Do more data due diligence and record in commit message.

* stream clone bundle requirements testing

* More tests around revlog format (verification of zstd frame header)

diff --git a/mercurial/help/config.txt b/mercurial/help/config.txt
--- a/mercurial/help/config.txt
+++ b/mercurial/help/config.txt
@@ -687,6 +687,11 @@ Example for ``~/.hgrc``::
 ``format``
 ----------
 
+``compengine``
+    Compression engine to use for storage.
+
+    Default: ``zlib``
+
 ``usegeneraldelta``
     Enable or disable the "generaldelta" repository format which improves
     repository compression by allowing "revlog" to store delta against arbitrary
diff --git a/mercurial/help/internals/requirements.txt b/mercurial/help/internals/requirements.txt
--- a/mercurial/help/internals/requirements.txt
+++ b/mercurial/help/internals/requirements.txt
@@ -106,3 +106,12 @@ one manifest per directory (as opposed t
 Support for this requirement was added in Mercurial 3.4 (released
 August 2015). The requirement is currently experimental and is
 disabled by default.
+
+compengine*
+===========
+
+Denotes a compression engine required to open files in the repository.
+``*`` is a compression engine name, such as ``zstd``.
+
+Support for this requirement was added in Mercurial 4.1 (released
+February 2017).
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -284,6 +284,12 @@ class localrepository(object):
         else:
             self.supported = self._basesupported
 
+        # Add compression engines.
+        for name in util.compengines:
+            engine = util.compengines[name]
+            if engine.revlogheader:
+                self.supported.add('compengine%s' % name)
+
         if not self.vfs.isdir():
             if create:
                 self.requirements = newreporequirements(self)
@@ -397,6 +403,10 @@ class localrepository(object):
         self.svfs.options['aggressivemergedeltas'] = aggressivemergedeltas
         self.svfs.options['lazydeltabase'] = not scmutil.gddeltaconfig(self.ui)
 
+        for r in self.requirements:
+            if r.startswith('compengine'):
+                self.svfs.options['compengine'] = r[len('compengine'):]
+
     def _writerequirements(self):
         scmutil.writerequires(self.vfs, self.requirements)
 
@@ -1994,6 +2004,17 @@ def newreporequirements(repo):
             if ui.configbool('format', 'dotencode', True):
                 requirements.add('dotencode')
 
+    compengine = ui.config('format', 'compengine', 'zlib')
+    if compengine not in util.compengines:
+        raise error.Abort(_('compression engine %s defined by '
+                            'format.compengine not available') % compengine,
+                          hint=_('run "hg debuginstall" to list available '
+                                 'compression engines'))
+
+    # zlib is the historical default and doesn't need an explicit requirement.
+    if compengine != 'zlib':
+        requirements.add('compengine%s' % compengine)
+
     if scmutil.gdinitconfig(ui):
         requirements.add('generaldelta')
     if ui.configbool('experimental', 'treemanifest', False):
diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -228,6 +228,7 @@ class revlog(object):
         # Mapping of revision integer to full node.
         self._nodecache = {nullid: nullrev}
         self._nodepos = None
+        self._compengine = 'zlib'
 
         v = REVLOG_DEFAULT_VERSION
         opts = getattr(opener, 'options', None)
@@ -244,6 +245,8 @@ class revlog(object):
             if 'aggressivemergedeltas' in opts:
                 self._aggressivemergedeltas = opts['aggressivemergedeltas']
             self._lazydeltabase = bool(opts.get('lazydeltabase', False))
+            if 'compengine' in opts:
+                self._compengine = opts['compengine']
 
         if self._chunkcachesize <= 0:
             raise RevlogError(_('revlog chunk cache size %r is not greater '
@@ -301,7 +304,7 @@ class revlog(object):
 
     @util.propertycache
     def _compressor(self):
-        return util.compengines['zlib'].revlogcompressor()
+        return util.compengines[self._compengine].revlogcompressor()
 
     def tip(self):
         return self.node(len(self.index) - 2)
diff --git a/tests/test-help.t b/tests/test-help.t
--- a/tests/test-help.t
+++ b/tests/test-help.t
@@ -1223,6 +1223,8 @@ Separate sections from subsections
       "format"
       --------
   
+      "compengine"
+  
       "usegeneraldelta"
   
       "dotencode"
diff --git a/tests/test-repo-compengines.t b/tests/test-repo-compengines.t
new file mode 100644
--- /dev/null
+++ b/tests/test-repo-compengines.t
@@ -0,0 +1,78 @@
+A new repository uses zlib storage, which doesn't require a requirement
+
+  $ hg init default
+  $ cd default
+  $ cat .hg/requires
+  dotencode
+  fncache
+  generaldelta
+  revlogv1
+  store
+
+  $ touch foo
+  $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text to trigger compression'
+  $ hg debugrevlog -c | grep 0x78
+      0x78 (x)  :   1 (100.00%)
+      0x78 (x)  : 110 (100.00%)
+
+  $ cd ..
+
+Unknown compression engine to format.compengine aborts
+
+  $ hg --config format.compengine=unknown init unknown
+  abort: compression engine unknown defined by format.compengine not available
+  (run "hg debuginstall" to list available compression engines)
+  [255]
+
+A requirement specifying an unknown compression engine results in bail
+
+  $ hg init unknownrequirement
+  $ cd unknownrequirement
+  $ echo compengineunknown >> .hg/requires
+  $ hg log
+  abort: repository requires features unknown to this Mercurial: compengineunknown!
+  (see https://mercurial-scm.org/wiki/MissingRequirement for more information)
+  [255]
+
+  $ cd ..
+
+#if zstd
+
+  $ hg --config format.compengine=zstd init zstd
+  $ cd zstd
+  $ cat .hg/requires
+  compenginezstd
+  dotencode
+  fncache
+  generaldelta
+  revlogv1
+  store
+
+  $ touch foo
+  $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text'
+
+  $ hg debugrevlog -c | grep 0x28
+      0x28      :  1 (100.00%)
+      0x28      : 98 (100.00%)
+
+  $ cd ..
+
+Specifying a new format.compengine on an existing repo won't introduce data
+with that engine or a requirement
+
+  $ cd default
+  $ touch bar
+  $ hg --config format.compengine=zstd -q commit -A -m 'add bar with a lot of repeated repeated repeated text'
+
+  $ cat .hg/requires
+  dotencode
+  fncache
+  generaldelta
+  revlogv1
+  store
+
+  $ hg debugrevlog -c | grep 0x78
+      0x78 (x)  :   2 (100.00%)
+      0x78 (x)  : 199 (100.00%)
+
+#endif


More information about the Mercurial-devel mailing list