Note:

This page is primarily intended for developers of Mercurial.

Adding a cache for dirstate

Status: Project

Main proponents: Raphaël Gomès

/!\ This is a speculative project and does not represent any firm decisions on future behavior.

This page describes the plan and status for adding a new caching mechanism for dirstate.status.

1. Goal

Doing the opendir/readdir/closedir loop on the tracked files and the rest of the working directory is the most expensive operation in status. Reducing the number of readdirs has a huge potential for speed improvement.

In addition, the underlying status operation is used by various core commands like hg commit or hg diff, so any performance improvement to dirstate.status will have a wide impact.

Case in point, Valentin Gatien-Baron introduced the first prototype for hg status in Rust in early 2019, which had such a cache: it removed about 80% of the status time in his use-case.

2. Detailed description

2.1. General principle

We cache all traversed directories' mtimes at a nanosecond precision and whether they are cache-able (see "What to cache") to avoid having to check them twice, once during dirstate stat'ing which is inevitable, and once during the unknown phase. See "Caveats" for its limitations.

2.2. Expected performance gains

The difference can already be seen on a clean repo (Netbeans' at 100k files):

This was run on a laptop with 4 physical cores, which is a good representation of the average user's configuration.

On a different repository with a lot of ignored files and a beefier 16-core machine:

This is a massive improvement that scales better the larger the repository. It should also scale better with repos that have more files, especially many files in few directories.

2.3. Overview

2.3.1. Caveats

This cache is of limited portability since it relies on the mtime of directories changing when files are added/removed in them, which does not work on all pairs of OS/filesystem.

It also relies on fs-level nanosecond precision of mtime which - while not fundamental - makes it possible to assume that every change results in a different mtime (which is not the case at second granularity). This simplifies the code, and makes the performance simpler to understand.

We need to have a reliable (and hopefully fast) way of checking those rules. Merely checking the mtime is not sufficient because the OS may have nanosecond precision in the disk cache irrespective of the precision on disk.

The simplest solution might be to just keep an array of compatible filesystems (for now we only support Linux in Rust anyway) and make a statvfs call to check.

2.3.2. What to cache

This cache will store directories that only contain files that are:

We cannot cache files in "added" state since we can just hg forget them, they disappear from the dirstate without invalidating the cache.

Caching "removed" files is fine, since running hg add on a removed file will only put it back in its previous state in the dirstate.

Caching modified (merged), and normal files is fine, since any update to them has to touch the mtime.

We *could* cache the unknown files, but it would make the format more complicated, and probably make the performance worse overall for a limited interest. Keep a clean working directory and your .hgignore up-to-date and you'll be fine.

To recap: we are caching directories containing only ignored, normal, removed, or merged (modified) files.

2.3.3. Operation

This cache is used when listing unknown files (bare hg status) and when *not* listing ignored files. It is updated during the traversal recursively

Creating/updating a directory cache entry is storing:

The cache is updated on-disk all at once, this is not an append-only cache.

During traversal, each directory is looked up in the cache. If its mtime is unchanged and it only contains cache-able files, it has already been stat'ed during the dirstate phase. Otherwise fall back to stat'ing and update the cache entry.

2.4. Implementation

For now, I intend for this cache to be Rust-only. It's an optimization, only has local impact and is not extremely portable (see above). Porting it to the Python implementation later might be worth the effort.

2.4.1. Cache integrity and invalidation

The cache should be still valid if:

We can check this with a (probably SHA2-blake2) hash of everything in this list, although it might be interesting to store the fs separately.

2.4.2. Proposed binary format

Everything in big-endian.

3. Roadmap

4. See Also


CategoryDeveloper CategoryNewFeatures

DirsCachePlan (last edited 2021-05-27 08:25:10 by RaphaelGomes)