RFC: Implementing shallow clones (cloning only a subset of history)

Peter Arrenbrecht peter.arrenbrecht at gmail.com
Thu Jun 5 03:04:17 CDT 2008


Hi all

I'm working on adding what I call shallow clones to Hg. These are
clones that do not pull the entire history from a server, but only a
subset starting at a given revision. The idea is that one often does
not need the entire history to work on new features, but only a recent
part of it. So shallow clones can save bandwith and disk space.

I know this overlaps with a GSoC project. But the main focus of that
project is narrow clones, meaning cloning a subset of files, not a
subset of history.

The work is related to TrimmingHistory[1] and OverlayRepository[2],
but the approach is different. In contrast to the punching in
TrimmingHistory[1], I do not want to have to keep the entire index
around. Only the part of it that we're interested in. This is done by
calling localrepo.changegroup() with a suitable base revision list and
a new flag to make it return the initial revisions in full (not as a
delta). For the moment I do not attempt to dynamically pull missing
data as needed. A shallow clone is a shallow clone.

This means we shall have missing parent revs. These I currently simply
set to nullrev. The main problem will then be to ensure we don't
bungle merges because of missing revs.

The new aspect with shallow clones is what I call "disconnected"
heads. These are heads that would normally be related, but their
common ancestor is missing in the shallow clone. I propose that merge
no longer accept unrelated heads, that is, heads whose common ancestor
is nullrev, unless --force is specified. This ensures we do not
accidentally merge disconnected heads, as they will appear unrelated
to the shallow clone. If necessary for backwards compatibility, we can
make merge behave in this way only for shallow clones. They can be
identified by their changelog having shallow revs. We might also issue
warnings or abort if the ancestor-detecting code in merge touches
shallow revs (I have not experimented with this yet).

It may be necessary to keep the desired base rev using when cloning in
.hg/hgrc so subsequent pulls can specify it again. This to avoid
pulling undesired history with new heads that reference it.

I'm using bit 0 of the revlog index entry flags to flag shallow revs,
that is revs with a missing parent rev. This is currently used by `hg
verify` to flag such revs as errors with a meaningful message, and by
revlog.revision() to skip the hash check.

So far, I have a very basic test scenario working: linear history, do
a shallow clone, log, update, verify. Other scenarios I am going to
test include:

	* pulling does not pull formerly ignored revs in same head
	* pulling does not pull entire history because of merge with formerly
disconnected heads
	* allow to pull disconnected heads
	* don't allow to merge disconnected heads by default
	* allow to merge heads with disconnected ancestry, but a known common ancestor
	* should pulling new disconnected history be possible?
	* can we bungle a merge of to connected heads because a nearer common
ancestor is missing
	* can push back to original repo
	* can clone shallow repo; must again be shallow
	* can pull from shallow repo; must again be shallow
	* can push back to shallow repo
	* can bundle and unbundle shallow revs; must again be shallow
	* can email shallow revs

The patch queue (still very much a work in progress) can be obtained from

	http://freehg.org/u/parren/hg-shallow-clone-queue

and is currently based on 1603bba96411 from crew.

Comments welcome!
-parren

[1] http://www.selenic.com/mercurial/wiki/index.cgi/TrimmingHistory
[2] http://www.selenic.com/mercurial/wiki/index.cgi/OverlayRepository


More information about the Mercurial-devel mailing list