This document represents a proposal only. This proposal can also be considered only applicable for repositories that require it for performance/scaling reasons or for server environments that desire some of the side-effects.
Deficiencies in Existing Store Format
Mercurial's existing store model is based on separate file(s) per logical tracked path. If you have 100,000 files under version control, there will be at least 100,000 files (plus directory entries) in the .hg/store directory. The unbound growth of number of files in the .hg directory poses a number of problems:
higher probability of stat cache eviction -> lowered performance
various filesystems don't efficiently create thousands of files (most notably NTFS - see issue4889)
- various filesystems don't efficiently delete thousands of files
- In server environments where dozens, hundreds, or even thousands of repos are stored, extreme inode count can be problematic
It's worth explicitly calling out performance problems on Windows. On that platform, unbundling mozilla-central is several minutes slower than on other platforms because closing file handles that have been appended to is slow on NTFS. This can be mitigated by doing I/O on separate threads, but the underlying problem is still there.
There are also related problems from the separate file per logical track path storage model and the (append-only) store format of today:
- redundancy in storage for copied/moved files
- causes store size to bloat if there are excessive copies or moves
- may restrict people from performing copies/moves because of scaling concerns
- breaks an important rule of developer tools which is that they shouldn't dictate less-than-desirable workflows
obtaining a consistent view/snapshot of the store (for e.g. stream bundle generation) requires walking and stat()ing the entire store while a write lock is obtained. More files -> longer lock
- this is related to read locks
- this is related to how the changelog files are updated last as part of the transaction
- data from obsolete/hidden changesets lingers forever
- wastes space
- provides little to no value (value reaches 0 over time)
- adds overhead because repository filter needs to track more and more omitted revisions
- requires linkrev adjustment, which adds overhead
- transaction close overhead is proportional to number of files being modified
The goal of this plan is to establish a new repository store format that relies on fewer files and doesn't suffer from some of the inefficiencies and scaling concerns of the existing model.
Advantages of Existing Store Format
Before we design a new store format, it's worth calling out what the existing store model gets right:
- Constant time lookup
- We know which file(s) data is in because the tracked path under version control maps to an explicit filename in the store. There is little to no cost to finding where data is located.
- Cost today is resolve file name (essentially free), file open(s) (cheap), read index (moderate), seek (cheap), read (moderate)
- Bound lookup time has performance implications for nearly every repository operation. This is especially true for operations that read the entire repository, such as clones or heavy pulls.
- Data from the changelog and manifests is generally more important to performance than files. But files are important for operations like diff generation and bundling (which is required server side for servicing pulls).
- Git's storage model (loose object files + multiple packfiles) has scaling problems because data for a specific object could be in an unbound number of locations. As the number of storage locations grows, performance worsens. Periodic repacks consolidate the storage into packfiles or even a single packfile. But on large repositories, repacks are time and resource prohibitive.They can be extremely painful for server operators especially.
- History for tracked paths are independent (related to above point about lookup time)
- To load history for a particular tracked path doesn't require loading data for unrelated tracked paths
Contrast with Git's generic key-value based storage model and complicated pack heuristics where data for a tracked path can be scattered all over the place
- Append only files and sequential I/O
- Append only writes are good for performance
- Sequential reads are good for performance (especially on magnetic disks) and are good for caching
- Append only writes make transaction rollback simpler (just truncate the file)
- Append only writes make read locks easier (establish read barrier that you can't read past)
- Append only files make incremental backups much easier since backup time/cost is proportional to new repository data since last backup
- No garbage collection
- Scaling garbage collection is hard and resource intensive. See Git.
- Per-file ACLs are possible
Google has a FUSE filesystem called <nop>CitC that can enforce per-file ACLs, including in the store directory. We can't do this for packed files.
- As long as the packed repo is optional (even if it becomes the default, that's fine, we can just opt out) we should be fine.
Now that we've outlined the advantages and disadvantages of the existing store format, let's piece together a list of desirable properties for the new store format:
- Requires as few files as possible and/or allows for compacting the store into fewer files
- Bound lookup time for finding data
- Only store data for copied and moved files once
- Consistent read-only views/snapshots of repositories can be obtained efficiently
- Obsoleting/hiding changesets doesn't introduce unbound overhead
- Transaction rollback is as efficient as possible
- Append-only files and sequential I/O is used heavily
- Per-file log/blame is fast
- No garbage collection operations randomly performed during commands
Brainstorming Store Formats
Before we dive into the new proposed store, it is important to confront some realities.
First, obsolete/hidden changesets will grow unbounded and add unbounded overhead unless there is some form of garbage collection. There is overhead for reading the changelog and performing linkrev adjustment. Solving garbage collection almost certainly means reading all data and rewriting it out again. For very large repositories, these operations will require gigabytes of I/O and likely a few minutes of wall time to perform (roughly equivalent to producing a bundle of the entire repo). These can certainly be performed out-of-band and periodically. But performance will degrade over time until the effects of obsolete/hidden changesets are rectified.
Second, preventing storage redundancy for file copies and moves requires reading from multiple store paths / revlogs.It also may require some form of additional indexing or optimization pass (which resembles what garbage collection does) in order to preserve efficient read and write characteristics.
Third, we have to pick from reducing total file count, bounded data lookup times, and simple/append-only storage/indexing. Let's elobarate.
Our smallest fundamental unit of storage is a delta. It is applied against a parent, which is applied against its parent, which is applied against its parent, etc to obtain the fulltext of something. Deltas are grouped by store path and collected into revlogs. To facilitate rapid lookup of specific deltas and rapid revision resolution, revlogs have an index containing offsets to individual deltas. An index into deltas must exist for performance reasons and is non-negotiable. The existence of a revlog as a collection of deltas for a single logical entity should almost certainly exist because it provides compelling advantages (notably performance) over a more generic key-value store (such as Git's model). You should at least agree that space proximity for a store path is important (data for a particular store path can be written/read using mostly sequential I/O).
Unbound number of filesystem files is bad and preventing this is a major goal of the new store format. This necessitates writing data for multiple store paths (revlogs) into the same filesystem file.
Let's imagine a new entity called a "packed revlog." It is a single file holding the data of multiple store paths / revlogs. There has to be an index containing the offsets of individual entries inside the "packed revlog" so readers can find data without full scans. Just like indexes into deltas in revlogs are necessary for performance reasons, so are these indexes. For exploratory purposes, let's assume "packed revlogs" may or may not be immutable (but always append only) and that the index can exist in a separate location or at the beginning of the file. Let's also ignore the performance concerns with stripping a repo with "packed revlog" data for now.
Real quick, it's worth pointing out that a "packed revlog" - holding nearly the complete state of a repository - could be used to efficiently clone repositories. The existing "stream clone" feature is conceptually very similar to "packed revlogs." But a "packed revlog" takes it to the next level: with "packed revlogs," hg clone would effectively be piping the server response to a local file with very minimal processing. Combine this with the bundle clone feature, and clones could perform at 100+MB/s by streaming a static file from a HTTP server to a local filesystem - basically as fast as your network and local I/O allow. You could imagine performing a clone of a 1 GB repo in under 10s if on a 1 Gbps network!
OK, we said we have to pick from reducing total file count, bounded data lookup times, and simple/append-only storage/indexing. With this new "packed revlog" concept in mind, let's examine potential store formats and their properties:
- The existing store format
- A store format based on creating multiple immutable "packed revlog" files. Each "packed revlog" file has its own (also immutable) index. A new "packed revlog" is produced for every store write.
- Append-only files (good)
- Much fewer files than separate file per store path. Still technically unbound, but the total file count count is likely 1-2 magnitudes lower and thus manageable. (good)
- Unbound lookup time because you have to read N indexes to find data for a store path (bad)
- This is Git's problem with regards to multiple packfiles and store alternates
- A store format based on creating multiple immutable "packed revlog" files but with a single, mutable index
- (Variation on #2)
- Append-only files for delta/revlog data (good)
- File count remains in check (good)
- Lookup time remains in check because there is a single, magical index (good)
- Multiple files may need to be opened to read data for a particular store path (not great but likely tolerable)
Requires a magical unified index. This is a difficult problem because building indexes with good performance and consistent views that respect transactions is hard. We'd likely be reinventing the internals of <insert popular database>. (bad)
How do we fare?
- reducing total file count: #2 and #3
- bounded data lookup times: #1 and #3
- simple/append-only storage/indexing: #1 and #2 across the board, #3 on the non-index pieces
No pure store format achieves all 3. #3 gets us the closest. But a unified magical index feels very complicated and like a lot of work to design and maintain. While we could explore using something like SQLite for the index (Python has SQLite support built-in), let's consider a hybrid solution.
We're going to design a hybrid store format that initially is based on a single "packed revlog" but then falls back to the existing store format for subsequent writes. When we clone a repository, we produce a single, immutable "packed revlog" and index. We've got low file count, bounded lookup times, and simple/sequential storage/indexes. So far so good. Now, when we perform a write on top of the pristine store (which consists of a single "packed revlog"), we adhere to the existing store format and write separate files per store path. These individual files are continuations/extents of revlog data from the "packed revlog." Let's call them "revlog extent files." (A "revlog extent file" for a store path not present in the "packed revlog" is essentially a revlog file of today.) Subsequent writes result in further creations/appends of "revlog extent files" as necessary. This store format is conceptually similar to the "union repo" and "bundle repo" entities already in core, where we are combining the contents of one repo with another or with a bundle. But in this hybrid proposal, we are combining the contents of a base "packed revlog" with data from "revlog extent files."
This hybrid store has decent properties:
- File count is initially small but is unbounded and approaches and approaches infinity over time
- Data lookup time is bounded because a store path can exist in at most 2 locations (the base "packed revlog" and a "revlog extent file"). Assuming the "packed revlog" index is loaded into memory, we are performing an extra Python dict lookup to locate data and an extra open() + seek() to retrieve it. This minor penalty is only present for newly-written data.
- Simple/append-only storage/indexing. All non-initial repo writes conform to the existing store format. Reads are sequential and come from at most 2 files.
We're a solid 2 of 3. What about unbound file count? One of the realities explained above is that some form of garbage collection / scan / rewrite is necessary in order to scale obsolete/hidden changesets and/or keep file copy/move data size in check. If we need to perform these actions anyway, then it makes sense to consolidate "revlog extent files" into "packed revlog" instances as part of that! We could devise a "packing" operation that:
- removes data related to obsolete/hidden changesets
- consolidates revlog data from a "packed revlog" and "revlog extent files" into a single entry inside a "packed revlog"
- consolidates duplicate revlog data that resulted from copies and moves
There are two basic approaches and outputs to such a "pack" operation:
- A single "packed revlog" + index
- Just one entity/file with data (no file count issues) and resembles "pure" state of store from after clone (good)
- Requires reading and rewriting entire store to produce. Cost is proportional to repo size and is therefore unbounded (bad)
- A new "packed revlog" + unified index (1 index, N "packed revlogs")
- No "revlog extent files" and total file count is mostly in check (good)
- Only requires reading and writing data that wasn't in "packed revlogs" before + index data (good)
- Reading data for a store path requires up to N file opens (not great but tolerable to certain values of N)
- Unified index must store which revlog offsets are in which file (likely tolerable)
In garbage collection terms, you can think of these operations as a "major compaction" and "minor compaction," respectively. You could envision a store that can perform both types of packing operations depending on what the user is willing to tolerate and what is recommended given the state of the repo. (Although we need to think long and hard about whether allowing repos to get in a state with hundreds of "packed revlogs" is a good idea. We're in a conundrum where a full pack is obviously the best and most needed for large repositories that would benefit from it but is also expensive on those very repositories.
If we performed store "packing," there are implications for locking and reader views. You can't just delete revlog files or "packed revlog" files from disk because a reader may need them. You could implement reader locks to prevent reading during the deletion phase of "packs," but then the repository wouldn't be readable when a "pack" is occurring. For servers, this would be devastating, especially where "packs" for large repositories could lock out readers for minutes. Locking out readers for an unbound duration is potentially a non-starter.
If we can't lock out readers, what do we do? A tried and true idea is "generations." There exists multiple different versions of the store on disk corresponding to states between "pack" operations. These are typically modeled as a monotonically increasing integer. Generation 0 is the initial data. Generation 1 is the data after the first "pack." Generation 2 the "pack" after that. etc. "Packing" would take a writer lock and copy/write store data to a new "generation." When complete, the repo's "current generation" pointer would be updated to the just-produced data and subsequent readers would consume data from the new generation. Old/active readers would continue reading data from the previous generation. Once all readers of old generations have gone away, the old generation can be deleted. This could be implemented as a side-effect of the final reader "releasing" its "handle" on the repo (a concept that doesn't exist in Mercurial today - see ReadLockPlan for something similar) or could be performed by a CRON or other periodic task (even randomly invoked by hg commands, although that's dangerous because random pauses are bad for user experience - see Git and its random repacking).
A major downsides of "packs" and generation is extra storage and I/O cost. If producing a full new generation, you'll need 2x storage for at least the duration of the pack. If only packing the "revlog extent files" (assuming we allow that mode), extra cost is size of data in those files (which should be small compared to the size of the overall repo, but it would approach infinity if a pack is never performed).
It's worth noting that a writer lock may not need to be taken when packing to a new generation! Assume generation N+1 is a deterministic result of the combination of "packed revlogs" and "revlog extent files" from generation N that initially itself doesn't have any "revlog extent files" (only the single "packed revlog" entity). As generation N+1 is being written, it could exist in a special "not ready, see generation N" state. Writers could read from generation N and then write to revlog extent files in the not-yet-fully-written generation N+1. Subsequent writers could combine generation N's state with N+1's revlog extent files. This is basically adding an extra layer to the magical revlog stacking that normally exists within a generation with its hybrid store approach. As soon as the single "packed revlog" for generation N+1 is fully written, generation N becomes irrelvant and writers use N+1's "packed revlog" instead of constructing the same state from N. In terms of implementation, this is somewhat complicated. And there may be some issues with copied data deduplication that rules it out. But it's a clever way for servers to perform low priority, frequent major packing without requiring a writer lock.
Implementing the Hybrid Store
(This section is far from complete.)
In the above section, we devised a hybrid store format consisting of "packed revlogs" and "revlog extent files". There were also multiple "generations" for the store to faciliate consistent views during packing. How would this look on disk and what would the commands to administer the repository look like?
A new format requirement would be required for the .hg/requires file. Let's call it packedstore.
The filesystem layout for .hg would look like:
The storestate file contains informationa about the generation state of the store. It at least contains a pointer to the active generation. If a pack is being performed, it contains metadata for that as well. The file is atomically replaced when changed.
When the repository is opened, the storestate file is read to obtain the current generation number. Then, the gen-N directory is used as the base for interactions with the store. If it exists, the 00packed.idx will be read into memory (or at least the file opened for perform a binary search when needed). An individual revlog is represented as data from the base 00packedN.data file(s) (if available) concatenated with the data from data/<path>.i or data/<path>.d (if available). From the revision access layer in Python, there is a single revlog: the existence of multiple backing files is abstracted away.
hg optimize would be a user-facing command used to perform optimizations of the repository/store, including "packing." Let's assume for now that hg optimize would perform a full pack by default (most repositories are small so it can't hurt to do a full pack).
- Stripping from repos with "packed revlogs"
- Is this just pruducing a new generation but with a reader lock?
- Integration with advanced reader/writer/transaction plan that's been architected
- This is a new store format. There are no backwards compatibility guarantees. How much do we want to scope creep it?
- New revlog format?
- File size limits
- Pretty sure filesystems and even Python have limits of 2GB or 4GB
- Surely this will require supporting multiple "packed revlog" files?