Mercurial manifest sharding

Problem statement: imagine you have 1m to 1b files.

Individual manifest RAM overhead is a problem somewhere in this range.

Checkout: we don't want to materialize the working copy on the machine, and we don't want the whole manifest on the local machine.

Limitations of large manifests/repos:

Two possible paths forward: explicit shard boundaries or doing a tree-state hash that can elide uninteresting-to-a-client subdirectories.

A sample repository with 1M files is hosted on Google Drive (441MB).

0.1. Tree-state hash

Current plan to make manifest hash something clients with only a partial checkout can do is to do a per-directory hash that bubbles up, and store entries for those directory nodes in their parent with a d in the flags entry. We considered using a hash of filename and hash mod == 0 do a shard, but decided that was probably going to lead to lots of churn, and also bakes the sharding scheme into the manifest hash (which might be suboptimal).

Will require client support to do pull of sharded manifests - that's the second step.

Challenges:

0.2. Explicit shards

A user