[PATCH RFC] internals: define revlog version 2

Gregory Szorc gregory.szorc at gmail.com
Tue May 16 05:15:49 UTC 2017


# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1494911192 25200
#      Mon May 15 22:06:32 2017 -0700
# Node ID 36035e27aeba572c06d450310ac6b041169b3f5e
# Parent  49621512d133723fcf54d9c5cd865a40a81446e9
internals: define revlog version 2

(This is an RFC. I think I have a compelling enough reason to justify
revlog v2 for zstd. But if we go this route, we should - as Augie
says - drive a truck through the opening.)

Today, revlogs define the encoding of revision data using an inline
1 byte header on each revision. This solution is nice because it is
extensible. But that extensibility comes at a cost: performance.

In order to decode the raw revision data, you need to first read it.
The way revlog reading is implemented (in Python) today is we iterate
through raw chunks, peek at the first byte, then invoke a decompressor
(if necessary). For a single threaded reader implemented in Python,
this is arguably sufficient. But in a future where we have
multi-threaded readers - possibly implemented in other languages - this
can introduce problems.

There are 2 intertwined problems: practically interfacing with a
multi-threaded reading API and performance. As a thought experiment,
how would you implement an efficient multi-threaded revlog reader
today?

To implement a high performance multi-threaded reader, you need to
perform CPU bound work outside of Python and outside of the GIL.
The CPU bound work for revlog reading is decompression. So, you
need a way to perform multi-threaded decompression outside the GIL
and Python.

If we have compression stored inline in revision data, you have 2
choices:

1) Teach the non-Python code about revision headers and how to
   handle them.
2) Load revision data and peek at the first byte *before* you call
   into the non-Python/multi-threaded code so the decoding API only
   sees data in a specific format. (This allows e.g. calling a
   multi-decompress function for zlib, zstd, etc.)

In #1, we end up duplicating revlog "parsing" logic in our non-Python
API. If you are implementing a full revlog reader, fine. But if all
you are trying to do is something like offload decompression to a
specialized multi-threaded API, this is unwanted complexity.

#2 gets the job done, but at a cost of sacrificed performance potential.
Today, by the time Python code calls into a decompressor, revlog data
is already read() from a file handle into memory. So, peeking at the
first byte is pretty cheap. But this I/O pattern undermines a future
optimization: memory mapped files. Instead of pre-buffering a file
handle into memory, what if we instead memory mapped the revlog. In
this world, actual I/O is deferred until first access. If encoding
is stored inline, peeking at the first byte will sparsely page the
revlog into memory. This is sub-optimal because the loading is
sparse (as opposed to sequential reads for the full data you want)
and is likely limited to a single thread (Python).

A solution to these deficiencies is to store the encoding state of
each revision out-of-band with the revision data. If you know the
encoding of each revision's data before actually reading the
revision data, you can:

* Filter/group data before loading it then call Call domain-specific,
  multi-decode/decompress APIs for specific formats. (This makes
  implementations easier since they don't need to "parse" data headers.)
* Memory map revlogs and use multi-threaded I/O *as part of the
  multi-threaded decode API* using optimal I/O access patterns.

You may think that optimizing for memory-mapped based multi-threaded
I/O is premature optimization. But consider that modern decompression
formats like lz4 and zstd can decode at several hundred MB/s at the
input end. And further consider that modern SSDs only achieve their
"sticker" read performance at queue depths greater than 1. In other
words, you need to have multiple read requests to the storage layer
in flight in order to achieve peak read throughput. And a single thread
threading a file descriptor won't be able to do that like multiple
threads triggering paging from a memory mapped file will. So
deferring I/O to a multi-threaded reader will make I/O less of a
bottleneck.

And to prove these kinds of performance optimizations are possible,
python-zstandard has a multi-threaded decompress API. You pass it
a special data structure representing a memory segment and an array
of (offset, length) tuples and it fans out to multiple threads to
do compression. Mercurial "just" needs to ensure it is passing
zstd data.

With this thought experiment completed and us having arrived at
a justification for declaring the encoding of revision data out-of-band
from the data itself, this commit introduces a new revlog format:
version 2.

The highlight of revlog v2 (so far) is a per-revlog feature flag denoting
the default revision encoding format and per-revision flags denoting the
encoding strategy used. The per-revision flags are stored in the index,
which (at least in current implementations) is already loaded into
memory at revlog open time. In revlog v2, there is no need to perform
I/O or peek at revision data to determine which encoding a specific
revision is using. Instead, when you want to decode multiple revisions
you can scan the already-loaded index and group revisions by encoding
then dispatch to a multi-threaded reader (which does I/O and decoding)
*without having to implement revlog-specific knowledge in the reader*.

The design of the flags assumes that individual revlogs will be
homogonous in terms of revision encoding: a revision will either be
encoded with the default encoding or not encoded. This reflects the
reality that some revisions should be stored raw because it makes
no sense to encode them. For example, some small inputs are larger
compressed than raw. We shouldn't waste bytes to store these
compressed and burn CPU to decompress them. So, we have a per-revision
flag indicating indicating whether the default encoding is used to
facilitate this common case.

And because extensibility is nice, we also have a per-revision flag
indicating whether the encoding is stored inline (like in existing
revlog formats). This encoding is not ideal from a performance
standpoint, but it does facilitate experimentation in extensions
land.

For default encoding formats, I added "not encoded" to facilitate
some special cases. For example, a server operator may wish to disable
revlog compression to facilitate faster reads. We could do this without
a revlog flag. But I think representing "I created this revlog with
the intent to never compress its entries" is something worth expressing
in the header, as this this tells all future consumers to write raw
data without relying on a config option to force that behavior.

I added zstd as an encoding. This should not be too controversial.
We know we want to support zstd in revlogs. zstd revlogs necessitate
a new repo requirement and new repo requirements mean you can invent
new formats since you are locking out legacy readers. So it seems
acceptable to invent a more optimal revlog version 2 while we're here.

I also added lz4. Facebook has an lz4 revlog extension. And lz4 in
revlogs makes a *lot* of sense. While I have no concrete plans to
add lz4 support to core Mercurial at this time (like I do with zstd),
I think we should reserve lz4 a seat at the table.

Note that in its current design, we /could/ shoehorn encoding related
per-revision flags into revlog version 1. However, it is a bit
constraining. You would need to design this such that a per-revision
flag indicated the new behavior (such as to be compatible with the
existing format). Then, you would need to define the encoding format
outside of the revlog, such as in a repo requirements file. While
certainly possible, I think this places too many constraints on
operations. For example, we may want to encode changelogs differently
from manifests differently from filelogs. We may even wish to encode
different filelogs differently! Having the default encoding in the
revlog itself buys you the flexibility to experiment with these
changes easily.

TODO

* We need to consider how zstd dictionaries (or other per-encoding
  supplementary data) can be indicated or stored within the revlog.
  This /could/ be a default encoding feature flag value indicating
  "zstd w/ dictionary".
* Since we're inventing a new revlog format, we should also support
  hashes that aren't SHA-1.
* Nuke the (complex) optimization that the 4 byte header is shared with
  first revision's index?
* Store integers in index as little endian (to facilitate 0-copy
  parsers)?
* Record raw chunk size (so we can pre-allocate exact-sized buffer
  for decoded/decompressed chunk)?
* Record delta chain base explicitly? (Minor perf wins for insertion.)
* Since we're inventing a new revlog format, we can also do anything
  else we want to do (I think marmoute wants to store ancestors
  count in changelog or something)

diff --git a/mercurial/help/internals/revlogs.txt b/mercurial/help/internals/revlogs.txt
--- a/mercurial/help/internals/revlogs.txt
+++ b/mercurial/help/internals/revlogs.txt
@@ -45,6 +45,9 @@ 0
 1
    RevlogNG (*next generation*). It replaced version 0 when it was
    implemented in 2006.
+2
+   Support for default encoding feature flag and per-revision encoding flags.
+   Available in Mercurial 4.3 (released August 2017).
 
 The feature flags short consists of bit flags. Where 0 is the least
 significant bit, the following bit offsets define flags:
@@ -53,8 +56,18 @@ 0
    Store revision data inline.
 1
    Generaldelta encoding.
+2-4
+   Default revision data encoding (version 2 and newer only).
 
-2-15
+   0 Inline encoding. The encoding of each revision is defined inline
+     in the data using a 1 byte header.
+   1 No encoding / raw data.
+   2 zlib compression.
+   3 zstd compression (includes zstd frame header).
+   4 lz4 compression (includes lz4 frame header). (Not supported by
+   official Mercurial distribution.)
+   5-7 Reserved for future use.
+5-15
    Reserved for future use.
 
 The following header values are common:
@@ -99,6 +112,10 @@ 6-7 (2 bytes)
 
    2: REVIDX_EXTSTORED revision data is stored externally.
 
+   3: REVIDX_DEFAULT_ENCODING (revlog v2 and later only) revision data is
+   encoded using the default as specified by the feature flag in the revlog
+   header. If unset, data is stored raw, without encoding.
+
 8-11 (4 bytes)
    Compressed length of revision data / chunk as stored in revlog.
 
@@ -185,11 +202,15 @@ The actual layout of revlog files on dis
 *store format*. Typically, a ``.i`` file represents the index revlog
 (possibly containing inline data) and a ``.d`` file holds the revision data.
 
-Revision Entries
-================
+Encoding of Revision Entries
+============================
 
-Revision entries consist of an optional 1 byte header followed by an
-encoding of the revision data. The headers are as follows:
+Revision entries are stored either as raw data or are encoded (often with
+compression).
+
+Revlog versions ``0`` and ``1`` store the encoding format inline with
+revision data via a 1 byte header. The recognized header values are as
+follows:
 
 \0 (0x00)
    Revision data is the entirety of the entry, including this header.
@@ -200,6 +221,17 @@ x (0x78)
 
    The 0x78 value is actually the first byte of the zlib header (CMF byte).
 
+Revlog version ``2`` introduces a revlog feature flag to denote the default
+encoding for the revlog as well as per-revision flags defining encoding for
+each revision.
+
+Version ``2`` readers **must** consult the ``REVIDX_DEFAULT_ENCODING`` flag
+for each revision. If ``REVIDX_DEFAULT_ENCODING`` is set, the default encoding
+as specified by the revlog feature flag is used. Otherwise, revision data is
+stored raw, sans encoding. If the default encoding of the revlog is ``0``
+(encoding defined inline), revision data is interpreted the same as it is for
+revlog versions ``0`` and ``1``.
+
 Hash Computation
 ================
 


More information about the Mercurial-devel mailing list