This document represents a proposal only. This proposal can also be considered only applicable for repositories that require it for performance/scaling reasons or for server environments that desire some of the side-effects.

See also AtomicRepositoryLayoutPlan and ReadLockPlan for related topics.

Deficiencies in Existing Store Format

Mercurial's existing store model is based on separate file(s) per logical tracked path. If you have 100,000 files under version control, there will be at least 100,000 files (plus directory entries) in the .hg/store directory. The unbound growth of number of files in the .hg directory poses a number of problems:

It's worth explicitly calling out performance problems on Windows. On that platform, unbundling mozilla-central is several minutes slower than on other platforms because closing file handles that have been appended to is slow on NTFS. This can be mitigated by doing I/O on separate threads, but the underlying problem is still there.

There are also related problems from the separate file per logical track path storage model and the (append-only) store format of today:

The goal of this plan is to establish a new repository store format that relies on fewer files and doesn't suffer from some of the inefficiencies and scaling concerns of the existing model.

Advantages of Existing Store Format

Before we design a new store format, it's worth calling out what the existing store model gets right:

Desirable Properties

Now that we've outlined the advantages and disadvantages of the existing store format, let's piece together a list of desirable properties for the new store format:

Brainstorming Store Formats

Before we dive into the new proposed store, it is important to confront some realities.

First, obsolete/hidden changesets will grow unbounded and add unbounded overhead unless there is some form of garbage collection. There is overhead for reading the changelog and performing linkrev adjustment. Solving garbage collection almost certainly means reading all data and rewriting it out again. For very large repositories, these operations will require gigabytes of I/O and likely a few minutes of wall time to perform (roughly equivalent to producing a bundle of the entire repo). These can certainly be performed out-of-band and periodically. But performance will degrade over time until the effects of obsolete/hidden changesets are rectified.

Second, preventing storage redundancy for file copies and moves requires reading from multiple store paths / revlogs.It also may require some form of additional indexing or optimization pass (which resembles what garbage collection does) in order to preserve efficient read and write characteristics.

Third, we have to pick from reducing total file count, bounded data lookup times, and simple/append-only storage/indexing. Let's elobarate.

Our smallest fundamental unit of storage is a delta. It is applied against a parent, which is applied against its parent, which is applied against its parent, etc to obtain the fulltext of something. Deltas are grouped by store path and collected into revlogs. To facilitate rapid lookup of specific deltas and rapid revision resolution, revlogs have an index containing offsets to individual deltas. An index into deltas must exist for performance reasons and is non-negotiable. The existence of a revlog as a collection of deltas for a single logical entity should almost certainly exist because it provides compelling advantages (notably performance) over a more generic key-value store (such as Git's model). You should at least agree that space proximity for a store path is important (data for a particular store path can be written/read using mostly sequential I/O).

Unbound number of filesystem files is bad and preventing this is a major goal of the new store format. This necessitates writing data for multiple store paths (revlogs) into the same filesystem file.

Let's imagine a new entity called a "packed revlog." It is a single file holding the data of multiple store paths / revlogs. There has to be an index containing the offsets of individual entries inside the "packed revlog" so readers can find data without full scans. Just like indexes into deltas in revlogs are necessary for performance reasons, so are these indexes. For exploratory purposes, let's assume "packed revlogs" may or may not be immutable (but always append only) and that the index can exist in a separate location or at the beginning of the file. Let's also ignore the performance concerns with stripping a repo with "packed revlog" data for now.

Real quick, it's worth pointing out that a "packed revlog" - holding nearly the complete state of a repository - could be used to efficiently clone repositories. The existing "stream clone" feature is conceptually very similar to "packed revlogs." But a "packed revlog" takes it to the next level: with "packed revlogs," hg clone would effectively be piping the server response to a local file with very minimal processing. Combine this with the bundle clone feature, and clones could perform at 100+MB/s by streaming a static file from a HTTP server to a local filesystem - basically as fast as your network and local I/O allow. You could imagine performing a clone of a 1 GB repo in under 10s if on a 1 Gbps network!

OK, we said we have to pick from reducing total file count, bounded data lookup times, and simple/append-only storage/indexing. With this new "packed revlog" concept in mind, let's examine potential store formats and their properties:

  1. The existing store format
  2. A store format based on creating multiple immutable "packed revlog" files. Each "packed revlog" file has its own (also immutable) index. A new "packed revlog" is produced for every store write.
    • Append-only files (good)
    • Much fewer files than separate file per store path. Still technically unbound, but the total file count count is likely 1-2 magnitudes lower and thus manageable. (good)
    • Unbound lookup time because you have to read N indexes to find data for a store path (bad)
      • This is Git's problem with regards to multiple packfiles and store alternates
  3. A store format based on creating multiple immutable "packed revlog" files but with a single, mutable index
    • (Variation on #2)
    • Append-only files for delta/revlog data (good)
    • File count remains in check (good)
    • Lookup time remains in check because there is a single, magical index (good)
    • Multiple files may need to be opened to read data for a particular store path (not great but likely tolerable)
    • Requires a magical unified index. This is a difficult problem because building indexes with good performance and consistent views that respect transactions is hard. We'd likely be reinventing the internals of <insert popular database>. (bad)

How do we fare?

No pure store format achieves all 3. #3 gets us the closest. But a unified magical index feels very complicated and like a lot of work to design and maintain. While we could explore using something like SQLite for the index (Python has SQLite support built-in), let's consider a hybrid solution.

We're going to design a hybrid store format that initially is based on a single "packed revlog" but then falls back to the existing store format for subsequent writes. When we clone a repository, we produce a single, immutable "packed revlog" and index. We've got low file count, bounded lookup times, and simple/sequential storage/indexes. So far so good. Now, when we perform a write on top of the pristine store (which consists of a single "packed revlog"), we adhere to the existing store format and write separate files per store path. These individual files are continuations/extents of revlog data from the "packed revlog." Let's call them "revlog extent files." (A "revlog extent file" for a store path not present in the "packed revlog" is essentially a revlog file of today.) Subsequent writes result in further creations/appends of "revlog extent files" as necessary. This store format is conceptually similar to the "union repo" and "bundle repo" entities already in core, where we are combining the contents of one repo with another or with a bundle. But in this hybrid proposal, we are combining the contents of a base "packed revlog" with data from "revlog extent files."

This hybrid store has decent properties:

We're a solid 2 of 3. What about unbound file count? One of the realities explained above is that some form of garbage collection / scan / rewrite is necessary in order to scale obsolete/hidden changesets and/or keep file copy/move data size in check. If we need to perform these actions anyway, then it makes sense to consolidate "revlog extent files" into "packed revlog" instances as part of that! We could devise a "packing" operation that:

There are two basic approaches and outputs to such a "pack" operation:

  1. A single "packed revlog" + index
    1. Just one entity/file with data (no file count issues) and resembles "pure" state of store from after clone (good)
    2. Requires reading and rewriting entire store to produce. Cost is proportional to repo size and is therefore unbounded (bad)
  2. A new "packed revlog" + unified index (1 index, N "packed revlogs")
    1. No "revlog extent files" and total file count is mostly in check (good)
    2. Only requires reading and writing data that wasn't in "packed revlogs" before + index data (good)
    3. Reading data for a store path requires up to N file opens (not great but tolerable to certain values of N)
    4. Unified index must store which revlog offsets are in which file (likely tolerable)

In garbage collection terms, you can think of these operations as a "major compaction" and "minor compaction," respectively. You could envision a store that can perform both types of packing operations depending on what the user is willing to tolerate and what is recommended given the state of the repo. (Although we need to think long and hard about whether allowing repos to get in a state with hundreds of "packed revlogs" is a good idea. We're in a conundrum where a full pack is obviously the best and most needed for large repositories that would benefit from it but is also expensive on those very repositories.

If we performed store "packing," there are implications for locking and reader views. You can't just delete revlog files or "packed revlog" files from disk because a reader may need them. You could implement reader locks to prevent reading during the deletion phase of "packs," but then the repository wouldn't be readable when a "pack" is occurring. For servers, this would be devastating, especially where "packs" for large repositories could lock out readers for minutes. Locking out readers for an unbound duration is potentially a non-starter.

If we can't lock out readers, what do we do? A tried and true idea is "generations." There exists multiple different versions of the store on disk corresponding to states between "pack" operations. These are typically modeled as a monotonically increasing integer. Generation 0 is the initial data. Generation 1 is the data after the first "pack." Generation 2 the "pack" after that. etc. "Packing" would take a writer lock and copy/write store data to a new "generation." When complete, the repo's "current generation" pointer would be updated to the just-produced data and subsequent readers would consume data from the new generation. Old/active readers would continue reading data from the previous generation. Once all readers of old generations have gone away, the old generation can be deleted. This could be implemented as a side-effect of the final reader "releasing" its "handle" on the repo (a concept that doesn't exist in Mercurial today - see ReadLockPlan for something similar) or could be performed by a CRON or other periodic task (even randomly invoked by hg commands, although that's dangerous because random pauses are bad for user experience - see Git and its random repacking).

A major downsides of "packs" and generation is extra storage and I/O cost. If producing a full new generation, you'll need 2x storage for at least the duration of the pack. If only packing the "revlog extent files" (assuming we allow that mode), extra cost is size of data in those files (which should be small compared to the size of the overall repo, but it would approach infinity if a pack is never performed).

It's worth noting that a writer lock may not need to be taken when packing to a new generation! Assume generation N+1 is a deterministic result of the combination of "packed revlogs" and "revlog extent files" from generation N that initially itself doesn't have any "revlog extent files" (only the single "packed revlog" entity). As generation N+1 is being written, it could exist in a special "not ready, see generation N" state. Writers could read from generation N and then write to revlog extent files in the not-yet-fully-written generation N+1. Subsequent writers could combine generation N's state with N+1's revlog extent files. This is basically adding an extra layer to the magical revlog stacking that normally exists within a generation with its hybrid store approach. As soon as the single "packed revlog" for generation N+1 is fully written, generation N becomes irrelvant and writers use N+1's "packed revlog" instead of constructing the same state from N. In terms of implementation, this is somewhat complicated. And there may be some issues with copied data deduplication that rules it out. But it's a clever way for servers to perform low priority, frequent major packing without requiring a writer lock.

Implementing the Hybrid Store

(This section is far from complete.)

In the above section, we devised a hybrid store format consisting of "packed revlogs" and "revlog extent files". There were also multiple "generations" for the store to faciliate consistent views during packing. How would this look on disk and what would the commands to administer the repository look like?

A new format requirement would be required for the .hg/requires file. Let's call it packedstore.

The filesystem layout for .hg would look like:

The storestate file contains informationa about the generation state of the store. It at least contains a pointer to the active generation. If a pack is being performed, it contains metadata for that as well. The file is atomically replaced when changed.

When the repository is opened, the storestate file is read to obtain the current generation number. Then, the gen-N directory is used as the base for interactions with the store. If it exists, the 00packed.idx will be read into memory (or at least the file opened for perform a binary search when needed). An individual revlog is represented as data from the base 00packedN.data file(s) (if available) concatenated with the data from data/<path>.i or data/<path>.d (if available). From the revision access layer in Python, there is a single revlog: the existence of multiple backing files is abstracted away.

Commands

hg optimize would be a user-facing command used to perform optimizations of the repository/store, including "packing." Let's assume for now that hg optimize would perform a full pack by default (most repositories are small so it can't hurt to do a full pack).

Unaddressed Concerns


CategoryNewFeatures

PackedRepoPlan (last edited 2015-12-04 18:15:57 by KyleLippincott)