Mercurial manifest sharding

Dealing with big repos.

1. Introduction

Problem statement: imagine you have 1M to 1B files.

Individual manifest RAM overhead is a problem somewhere in this range.

Checkout: we don't want to materialize the working copy on the machine, and we don't want the whole manifest on the local machine.

Limitations of large and linear manifests:

A sample repository with 1M files is hosted on Google Drive (441MB).

2. Proposed Solution

Every directory will have its own manifest revlog. Each directory would thus have its own nodeid. This addresses the issues above:

Some further benefits:

Costs:

3. Current Plan

martinvonz to follow these steps:

  1. Parse manifests into a tree data structure when config option set
  2. When the config option is set, store one manifest revlog per directory
  3. Compare performance on large repository (e.g. Firefox) and improve performance
  4. Hack together some experimental narrow functionality in order to see how things work with push/pull over the network

4. Performance

The following are best-of-5 timings on the Mozilla repo converted to GeneralDelta. ~28k recent revisions were rewritten in order to not give a too unfair advantage to tree manifests it would get due to short delta chains if only a few revisions had been rewritten. We will try to keep these numbers up to date as we work on tree manifests.

Command

v1, flat

v2, flat

v2, tree

Comments

hg files -r .

0.791

1.174

2.318

Particularly parsing is not yet optimized, but tree still expected to be slower than flat

hg files -r . python/

0.379

0.743

0.149

~800 files out of 115k files

hg diff --change .

0.624

1.456

0.618

A ~7k-line diff

hg status --rev .~1000 --rev .

0.411

1.236

1.185

~43k differing files

hg status --rev .~10000 --rev .

1.080

1.634

2.362

~8k differing files

hg status --rev .~10000 --rev . -C python

1.825

2.224

1.251

hg rebase --keep -d new-tip~10 -r new-tip~8

-

4.573

3.653

~60% spent in dirstate.status

hg log --limit 10 -p python/

3.437

10.999

0.526

5. Alternatives considered

5.1. Sub-manifests at custom positions in tree

A user splits a shard out using a command like hg debugmarkshard foo/bar/baz, which is then stored as a sub-manifest in a different revlog.

Challenges:

Objections:

5.2. Tree-state hash, but flat manifest

Make manifest hash something clients with only a partial checkout can do is to do a per-directory hash that bubbles up, and store entries for those directory nodes in their parent with a d in the flags entry. We considered using a hash of filename and hash mod == 0 do a shard, but decided that was probably going to lead to lots of churn, and also bakes the sharding scheme into the manifest hash (which might be suboptimal).

Will require client support to do pull of sharded manifests - that's the second step.

Challenges:

6. Related: sparse checkouts

Currently hg sparse --include mobile/

doesn't matter if the repo has other stuff, you only get the mobile directory.

hg sparse --enable-profile mobile

profiles live in repo. .hgsparse

proposal: have team specific .hgsparse files in directories. Allows changes without contention. (hg sparse --enable-profile mobile[/.hgsparse])

future magic: hg clone --sparse mobile (to avoid initial full clone)

merges get a little complicated using regexps matching now. proposed to use directories for includes, but allow regex/glob for exclude (to allow not writing certain types of files, like photoshop files)

7. narrow changelog

See NarrowClonePlan


CategoryDeveloper