Mercurial manifest sharding

Problem statement: imagine you have 1m to 1b files.

Individual manifest RAM overhead is a problem somewhere in this range.

Checkout: we don't want to materialize the working copy on the machine, and we don't want the whole manifest on the local machine.

Limitations of large manifests/repos:

manifest too large for RAM
checkout too large for local disk
clone size too large for local disk
manifest resolution too much CPU
100k+ files on HFS+ has bad perf

Two possible paths forward: explicit shard boundaries or doing a tree-state hash that can elide uninteresting-to-a-client subdirectories.

A sample repository with 1M files is hosted on Google Drive (441MB).

0.1. Tree-state hash

Current plan to make manifest hash something clients with only a partial checkout can do is to do a per-directory hash that bubbles up, and store entries for those directory nodes in their parent with a d in the flags entry. We considered using a hash of filename and hash mod == 0 do a shard, but decided that was probably going to lead to lots of churn, and also bakes the sharding scheme into the manifest hash (which might be suboptimal).

Will require client support to do pull of sharded manifests - that's the second step.

Challenges:

means actually breaking out a shard is expensive, as you have to split/join for network traffic
pushing means the server has to produce a matching-spec narrow manifest to apply the delta
- durham proposes we could store the number of bytes the client had elided in the manifest, which would allow us to produce the same delta as though we were operating on the full manifest, and apply full-manifest deltas even when they contained bits we didn't care about.
need out-of-manifest management of some sort?

0.2. Explicit shards

A user

TreeManifestPlan

Mercurial manifest sharding

0.1. Tree-state hash

0.2. Explicit shards