[PATCH 3 of 3] help: document known deficiencies with revlog format

Gregory Szorc gregory.szorc at gmail.com
Mon Feb 27 15:54:02 EST 2017

# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1488228075 28800
#      Mon Feb 27 12:41:15 2017 -0800
# Node ID 4ecfe89fa8c1477ab54fa8ef271589a8b20c5497
# Parent  33cdab923127f930357032db7eb6c4c9441e73ae
help: document known deficiencies with revlog format

I've been doing a lot of work to properly integrate zstd into revlogs.
Related to this, I've been hacking on multi-threaded APIs for
python-zstandard with the intent that they be usable by Mercurial
to facilitate multi-threaded decompression of delta chains. The
performance of the python-zstandard APIs is *very* promising
(10x read speedup over zlib). However, the existing revlog format
won't be able to take full advantage of the speed potential because
of its current design.

In other news, a SHA-1 collision now exists and people are starting
to think more seriously about our migration strategy.

The common theme for these items is the existing revlog format.
As good as it is, it has some limitations.

The patch introduces documentation on known revlog deficiencies
and areas for improvement. I think I captured most of the low-level
problems. Some higher-level problems (such as filelog data for
copies/renames) or linkrev adjustment aren't captured. Someone
whose thought more about those can contribute write-ups of those.

I think documenting these deficiencies in the spec (as opposed to
say the wiki) is important because the problems reside with the
specification. Putting the deficiencies in the spec maximizes their
visibility and increases the chances that mistakes won't be repeated.

diff --git a/mercurial/help/internals/revlogs.txt b/mercurial/help/internals/revlogs.txt
--- a/mercurial/help/internals/revlogs.txt
+++ b/mercurial/help/internals/revlogs.txt
@@ -217,3 +217,183 @@ 1. Hash the parent nodes
 2. Hash the fulltext of the revision
 The 20 byte node ids of the parents are fed into the hasher in ascending order.
+Deficiencies and Areas for Improvement
+This section documents known problems with the revlog format and ways it can
+be improved.
+4 byte header is shared with first revision
+   The first 4 bytes of the revlog are shared with the *absolute offset*
+   field of the first revision's index entry. This is a clever optimization.
+   But it requires one-off handling when parsing index entries.
+   Dropping the optimization and having full separation between the revlog
+   header and index data would make parsing simpler at the cost of 4 extra
+   bytes per revlog.
+Integers stored as big-endian
+   Integers in revlog index entries are stored as big-endian. Most systems
+   that Mercurial runs on are little-endian.
+   Having a bit ordering mismatch between index entries and the running
+   machine means that integers must be converted before they can be
+   used. This means that a revlog implementation in a systems language on
+   little-endian machines can't simply cast a memory address as a struct to
+   read index entries. This adds overhead in the form of data conversion
+   and extra allocations.
+   Changing revlog data to little-endian would facilitate 0-copy operations
+   on most machines that Mercurial runs on and would make things faster.
+Delta chain base revision not explicitly recorded for generaldelta
+   For generaldelta revlogs, the index entry stores the revision for the
+   previous revision in the delta chain. To find the first or base revision
+   of the delta chain, one must walk index entries.
+   Revlogs limit the linear distance between chunk data in delta chains.
+   This requires resolving the base revision for a delta chain during
+   insertions. Insertion-heavy operations like changegroup application
+   thus require resolving several delta chain bases. This can lead to a
+   performance bottleneck resolving delta chains. That's why the
+   Python layer implements an LRU cache of delta chain bases. See
+   92ac2baaea86 for more.
+   Adding the delta chain base revision to index entries should be considered
+   as a way to mitigate this performance issue. Note: existing performance
+   issues with delta chain computation could be the result of delta chain
+   traversal being implemented in Python. It is possible an efficient
+   implementation in C would make delta chain base resolution "fast enough."
+Raw chunk size not explicitly recorded
+   Most revlog chunks are compressed. The revlog records the size of each
+   each chunk as physically stored in the revlog and the size of the revision
+   fulltext. However, it does not record the size of the raw chunk value
+   (chunks are frequently compressed).
+   If the compression format does not record the uncompressed size (such as
+   zlib), then decompression incurs extra memory reallocations. This could
+   involve continuously growing a buffer to hold output, shrinking a buffer
+   once decompression is finished, or having an over-allocated buffer hold
+   the result.
+   If the uncompressed size is stored in the chunk, this allows decompression
+   to allocate an exact-sized buffer. However, this isn't ideal for optimal
+   performance in some scenarios because the chunk must be read to obtain
+   the uncompressed size.
+   Operations like delta chain application need N raw chunks in memory
+   simultaneously. Because memory allocations can be expensive, it may be
+   desirable to allocate 1 output buffer instead of N. Ideally, this output
+   buffer would be exact sized (to avoid reallocations). However, this
+   requires a pre-scan over the chunks to determine the decompressed sizes.
+   This incurs extra work. And depending on how the chunk data is loaded
+   into memory, it could create extra work. For example, if chunk data is
+   in a mmap() address, then performing a pre-scan to resolve just the
+   decompressed size followed by a decompression scan could result in
+   significant extra work for the I/O layer.
+   This problem is compounded for multi-threaded operations. It is possible
+   for revlog chunk decompression to be performed in a thread pool. Again,
+   limiting the amount of allocations for output buffers is desired. But
+   this requires scanning. Having multiple threads perform a pre-scan and
+   decompression pass could sap performance, especially if the chunks are
+   backed by a mmap() that isn't fully paged in from the I/O layer.
+   Storing the raw chunk size in the index would prevent excessive memory
+   allocations and would facilitate use of shared output buffers without
+   having to scan chunk data.
+Chunk format / compression stored in chunk
+   Revision chunks have a 1 byte header describing the encoding of the
+   data that follows. This is typically used to identify the compression
+   format.
+   Having the format in the chunk itself means that the chunk needs to
+   accessed in order to determine what to do with it. In other words,
+   in order to know what decompressor to dispatch to, the first byte
+   of chunk data has to be available. If wanting to perform decompression
+   into a shared buffer or using a multi-decompress API, a pre-scan is
+   needed to discover which chunks are relevant. This has similar
+   disadvantages to reading the raw chunk size as
+   described above.
+   Having the chunk format self-describing is convenient from an
+   extensibility standpoint, as new formats can be added without changing
+   the revlog specification. However, it comes at a cost of requiring
+   reading the first byte of each chunk before dispatching to a decoder.
+   This can sap performance, especially for multi-threaded reading.
+   Storing the chunk format in the index would allow chunk data to be
+   dispatched to a decoder without scanning it first. A special chunk
+   format value could be reserved for *look at first byte of chunk*
+   to preserve the existing behavior and allow new data encodings to be
+   used without reserving a value for them in the revlog specification.
+Hash field has static size
+   The hash field in the index is a static 32 bytes. Currently, only SHA-1
+   (20 bytes) is supported. The field was made to be 32 bytes in order to
+   support other (longer) hashes without having to change the revlog
+   format.
+   There are a number of issues with the current approach.
+   First, only SHA-1 is currently supported. That means 12 of the 32 hash
+   bytes are unused. That's 20% of the 64-byte index entry size. Allocating
+   space that is never used is wasteful.
+   Second, while 32 bytes can hold a 256 bit hash, there is no storage
+   left over to specify which hash is used. The revlog entry needs to
+   specify which hash is used so the consumer knows how to invoke it.
+   This means that hash identity needs to be stored elsewhere (such as in
+   a revision flag) or the hash value needs to be truncated to make room
+   for the hash identity. Similar to this problem, hashes larger than
+   256 bits are obviously not supported.
+   Third, having a single hash field isn't friendly to multi-hashing or
+   hash transitions. Cryptographic hashes only get weaker over time. This
+   means that the hash function used by revlogs needs to be replaced
+   periodically. However, transitioning hashes is a massive challenge
+   because it requires that clients support the new hash mechanism. One
+   solution to this is multi-hashing, where a single revision can have
+   multiple hashes associated with it. A repository could be canonically
+   using SHA-256 hashes while still storing and serving SHA-1 hashes to
+   supported clients (via the wire protocol). But a static hash field
+   size only provides storage for a single hash value.
+   These problems go away if the hash field could be variable length. For
+   example, an entry using a SHA-1 hash would have a 20 byte hash
+   field. An entry using only SHA-256 would have a 32 byte hash field.
+   An entry using both SHA-1 and SHA-256 would have a 52 byte hash field
+   (or longer if hash identifiers were stored in the field).
+   Per-revision variable length hash fields aren't great for performance
+   because index entries would no longer be at easily-computed offsets.
+   Instead, there would need to be an index to index entries. This is
+   likely a non-starter. However, one could imagine a *hash field length*
+   field in the revlog header that declared the static length of the
+   hash field for all index entries. This would be long enough to hold
+   the longest hash field value. It would allow supporting new, longer
+   hashes or multi-hashing without having to change the revlog specification.
+No clear facility for storing arbitrary-length blob data
+   Revlogs consist of a header, index entries, and revision chunks. That's
+   it. There are extensible fixed-width fields reserved for future use.
+   But there isn't a mechanism for storing extra, arbitrary-length,
+   non-revision blob data.
+   In theory, arbitrary-length blob data could potentially be shoehorned
+   into a revision entry by setting a per-revision flag. But this requires
+   all entry consumers have one-off code to look for this flag and treat
+   the revision specially. It might also be possible to use the first N
+   revisions in the revlog for arbitrary data. These revisions could be
+   scanned quickly or denoted by a value in the revlog header.
+   One potential use of arbitrary-length non-revision data is compression
+   dictionaries. Zstd supports compression dictionaries. When used,
+   decompression is significantly faster. That makes them attractive
+   to Mercurial. However, they introduce a new problem: where to store
+   them. Having compression dictionaries embedded within revlogs is an
+   attractive solution because it retains the property that revlogs are
+   self-contained.

More information about the Mercurial-devel mailing list