RFC: Updated plan for adding shallow clones to Mercurial (partial history)

Thu Apr 2 02:19:04 CDT 2009

On Wed, Apr 1, 2009 at 9:23 PM, Matt Mackall <mpm at selenic.com> wrote:
> On Wed, 2009-04-01 at 14:40 +0200, Peter Arrenbrecht wrote:
>> Hi all
>>
>> I have uploaded a new, much updated plan for adding shallow clones to Mercurial:
>>
>>   http://www.selenic.com/mercurial/wiki/index.cgi/ShallowClone
>>
>> Feedback is very welcome.
>>
>> The plan is the product of quite a bit of prototyping which tells me
>> the basic ideas should be fairly sound. The last prototype which
>> already works to a large degree over the wire is at
>> http://bitbucket.org/parren/hg-shallow/. It differs markedly from the
>> proposed plan in how unwanted changesets are negotiated, though.
>>
>> Matt, can you imagine something like this plan being accepted into Hg
>> at some point? We might get a student tackling it in the current
>> Summer of Code.
>
> My intuition is that this is more complicated than necessary.
>
> In particular, there's a fair amount of trickery here with trimming and
> hiding various dangling links outside of the wanted history. If instead,
> you were to keep full revlog indexes around, you wouldn't have this
> problem. You would be able to trivially identify precisely what was a
> dangling (but otherwise valid) link and what was a corrupt link.

True. This would be the Punching approach, I think. But I don't
believe it would really reduce complexity. The bit about faking
parentids is pretty simple and works well enough in the prototype
already.

One worry I have with Punching is that it will expose all code
accessing historical data (as opposed to just cset descriptions) to
these punched revs. Maybe it's sufficient for the most part if revlog
just raises a N/A error when these are accessed, which would normally
abort the operation. And the few other locations do an explicit test
first or catch the exception.

Going forward, this approach could mean that new code would probably
not deal with shallow repos properly until someone first discovers it,
tests are added and the code is taught about punched revs. Not
necessarily a big problem, but something I guess we would mostly avoid
with my approach.

My other misgiving is it would still keep a lot of unwanted data
around on the client (and over the wire initially). For instance, if a
project went through a few reorgs, we'd end up with tons of pointless
files in the store.

Mainly, though, you're not solving the problem of formerly unwanted
filerevs suddenly being necessary after pulling a merge with a
formerly unwanted branch (my _possibly absent filerevs_). And you'd
still need a means to get them across the wire when they become
wanted. In my plan, I say the server needs to know what the client is
sure to have present so it can send them in the bundle. We'd still
need that, or you could, like in my current prototype, say: Let's just
query them on demand, when the client detects their absence. However,
I believe this has two problems:

 * Consider a pull where your first cset is the merge introducing the
absent filerev, then another one modifying said file. The normal
protocol would be sending the modification as a diff against the
absent filerev. Meaning the client code would have to request the
absent filerev out-of-band, while streaming in the bundle. Or else
we'd have to add some serious addrev-postponing logic so we can gather
all the queries into a single second request.

 * These queries mean the bundle created by incoming --bundle, and
others, might not be complete. This could break a few workflows. Tĥis
is true even if we gather the queries all into a single second
request.

>
> I'm also a little concerned by things like dangling merges looking like
> linear changesets and having only one root. The restrictions on push are
> a little confusing.

The "only one root" part could probably be lifted. But why is it worrying you?

And what about the linearization of the merges? Just a hunch (which I
share to some degree), or do you already see some more concrete
problems?

> Some silly ideas: what if we introduced a new pseudo-changeset named
> "prehistory" that all dangling links could point to?

Yes, I thought about this too. It would certainly help with clearly
identifying dangling merges (for glog, for instance). And its
changelist could introduce all the initial filerevs not introduced by
the root's changelist so they don't appear out of nowhere. But this
latter would break for formerly absent filerevs, as otherwise we would
have to _update_ its changelist.

In the end, I had a feeling it didn't buy too much given we have to be
sure never to send this one out over the wire or into a bundle, except
maybe when cloning a shallow repo.

But it's clearly debatable.

> What if we keep a complete index for just the changelog? This solves
> most of the linkrev and merge pointer issues.

Yep, it will help with merges in that we can safely discover if the
closest common ancestor (CCA) is present or not. In my approach, we
just have to forbid situations where we have a CA but there _might_ be
a CCA that is absent.

It might also simplify negotiation of incoming nodes, but I fear that
since we'd still need detection of possibly absent filerevs, it
wouldn't buy much. I'll think about this a bit more, though.

But the linkrevs? So they will point to their proper node. However,
will we keep around the node's changelist? More unwanted data. And
even if we do, the node's manifest will be empty (nullid?). So the
picture presented is not really much more consistent than by
redirecting linkrevs, I think.

And the complexity in finding out the rev to redirect to in my current
approach is limited as it's part of finding possibly absent filerevs,
which we won't avoid. Having the full changelog would avoid the need
for putting this redirection information on the wire, though.

A full changelog will give misleading information about what history
we have available locally, unless we teach all history displaying code
to either flag shallow nodes visually or skip them.

> What if we add something like 'archive' to the wire protocol so that a
> client can make a 'synthetic' repo up to a given rev from a flat set of
> files, then do a normal pull on top of it?

Isn't this 'archive' basically what my initial shallow clone does, by
sending along all filerevs mentioned in the root node's manifest?
Support for sending these filerevs is fairly trivial and some form of
it is necessary in any case later on for possibly absent filerevs
(where it is a bit less trivial).

What if I base this 'archive' on node A, which is on a separate branch
from node B? What happens when I pull? Do I get all of B with its
history way into the past? If not, then how is this different from the
approaches discussed above?

Thanks for your feedback,
-Peter (parren)