[PATCH 2 of 8 zstd-revlogs] revlog: move decompress() from module to revlog class (API)

Mon Jan 2 18:57:53 EST 2017

# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1483390816 28800
#      Mon Jan 02 13:00:16 2017 -0800
# Node ID 346b798126c521eb44fe480ddd25e2779df1b39b
# Parent  6740dc7106f3ce8aec48c6ebb67153753ac88aac
revlog: move decompress() from module to revlog class (API)

Upcoming patches will convert revlogs to use the compression engine
APIs to perform all things compression. The yet-to-be-introduced
APIs support a persistent "compressor" object so the same object
can be reused for multiple compression operations, leading to
better performance. In addition, compression engines like zstd
may wish to tweak compression engine state based on the revlog
(e.g. per-revlog compression dictionaries).

A global and shared decompress() function will shortly no longer
make much sense. So, we move decompress() to be a method of the
revlog class. It joins compress() there.

On the mozilla-unified repo, we can measure the impact of this change
on reading performance:

$ hg perfrevlogchunks -c
! chunk
! wall 1.932573 comb 1.930000 user 1.900000 sys 0.030000 (best of 6)
! wall 1.955183 comb 1.960000 user 1.930000 sys 0.030000 (best of 6)
! chunk batch
! wall 1.787879 comb 1.780000 user 1.770000 sys 0.010000 (best of 6
! wall 1.774444 comb 1.770000 user 1.750000 sys 0.020000 (best of 6)

"chunk" appeared to become slower but "chunk batch" got faster. Upon
further examination by running both sets multiple times, the numbers
appear to converge across all runs. This tells me that there is no
perceived performance impact to this refactor.

diff --git a/contrib/perf.py b/contrib/perf.py
--- a/contrib/perf.py
+++ b/contrib/perf.py
@@ -989,7 +989,7 @@ def perfrevlogrevision(ui, repo, file_, 
                 chunkstart += (rev + 1) * iosize
             chunklength = length(rev)
             b = buffer(data, chunkstart - offset, chunklength)
-            revlog.decompress(b)
+            r.decompress(b)
 
     def dopatch(text, bins):
         if not cache:
diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -99,22 +99,6 @@ def hash(text, p1, p2):
     s.update(text)
     return s.digest()
 
-def decompress(bin):
-    """ decompress the given input """
-    if not bin:
-        return bin
-    t = bin[0]
-    if t == '\0':
-        return bin
-    if t == 'x':
-        try:
-            return _decompress(bin)
-        except zlib.error as e:
-            raise RevlogError(_("revlog decompress error: %s") % str(e))
-    if t == 'u':
-        return util.buffer(bin, 1)
-    raise RevlogError(_("unknown compression type %r") % t)
-
 # index v0:
 #  4 bytes: offset
 #  4 bytes: compressed length
@@ -1138,7 +1122,7 @@ class revlog(object):
 
         Returns a str holding uncompressed data for the requested revision.
         """
-        return decompress(self._chunkraw(rev, rev, df=df)[1])
+        return self.decompress(self._chunkraw(rev, rev, df=df)[1])
 
     def _chunks(self, revs, df=None):
         """Obtain decompressed chunks for the specified revisions.
@@ -1171,12 +1155,13 @@ class revlog(object):
             # 2G on Windows
             return [self._chunk(rev, df=df) for rev in revs]
 
+        decomp = self.decompress
         for rev in revs:
             chunkstart = start(rev)
             if inline:
                 chunkstart += (rev + 1) * iosize
             chunklength = length(rev)
-            ladd(decompress(buffer(data, chunkstart - offset, chunklength)))
+            ladd(decomp(buffer(data, chunkstart - offset, chunklength)))
 
         return l
 
@@ -1392,6 +1377,26 @@ class revlog(object):
             return ('u', text)
         return ("", bin)
 
+    def decompress(self, data):
+        """Decompress a revlog chunk.
+
+        The chunk is expected to begin with a header identifying the
+        format type so it can be routed to an appropriate decompressor.
+        """
+        if not data:
+            return data
+        t = data[0]
+        if t == '\0':
+            return data
+        if t == 'x':
+            try:
+                return _decompress(data)
+            except zlib.error as e:
+                raise RevlogError(_('revlog decompress error: %s') % str(e))
+        if t == 'u':
+            return util.buffer(data, 1)
+        raise RevlogError(_('unknown compression type %r') % t)
+
     def _isgooddelta(self, d, textlen):
         """Returns True if the given delta is good. Good means that it is within
         the disk span, disk size, and chain length bounds that we know to be