[GSoC-2009] Shallow Clones

Fri Apr 3 14:32:58 CDT 2009

Hi Matt,

  First of all sorry for the late reply. I had to give a
presentation in college today.

On Thu, Apr 2, 2009 at 1:26 AM, Matt Mackall <mpm at selenic.com> wrote:

> On Wed, 2009-04-01 at 20:54 +0530, Madhusudan C.S wrote:
> > Hi all,
> >      This is a rough draft proposal I have come up with for Shallow
> > clones based on the Plans that Peter just sent in to the list. I
> > request you all to review it for me. Since I am running out of time
> > and djc urged all students to submit it to Google by today, I am
> > submitting it there too. I continue editing it and making changes as
> > per community suggestions and opinion. (Please point out if there are
> > any kind of mistakes, including technical, spelling and grammatical.
> > Since this is the first version, I am doing all the work to revise the
> > proposal to correct such mistakes, your suggestions will help me make
> > it faster.)
>
> Given the complexity and number of possible design trade-offs here,
> I consider this particular project to still be in the research stage.
>
> In some ways the most important part of this whole project will be
> engaging with the community to settle on a good design. That process
> will likely not happen 'all on paper' - it's going to need some
> experimentation, possibly multiple proofs-of-concept, and lots of
> discussion. It may in fact take the bulk of the GSoC time, with a
> successful GSoc result being a prototype that's a solid foundation for a
> complete feature.
>
> On the other hand, it's also quite possible that we'll settle on a good
> path early on (maybe Peter will convince me that his is the right one).
>
> My point is: the schedule and the breakdown of the current design is not
> very interesting to me - it might all be irrelevant next week. It's much
> more important to me that you demonstrate understanding of the
> complexities of the problem. A detailed breakdown of the problem with
> attention to the 'traps' of particular approaches and some weighing of
> the trade-offs would be much more useful.
>

Let me make an attempt to explain the problem I have understood.
These are the problems that we face, in no particular order.

Firstly since Mercurial is a Distributed RCS, it has multiple lines of
development at any given point in time. Even though the history
is too big and some of them may become irrelevant after sometime,
there may be portions of history (may be on a different branch)
which may not be merged yet to the other branches.

Say I am working on a project and branch it at some point. The
default branch undergoes changes rapidly because of more manpower
and other branch is pretty slow. Say after 2-3 years, whatever I
did in default branch before 1 year is irrelevant, but the it may not
be the same on other branch. (This is one case which we need to
consider). When we want to merge this with the shallow clone of
default branch, we have a problem here, since we don't have
its common parent for a 3-way merge.

This is the case with every merge. Also when there are branches
which have multiple common ancestor, as in the case of one of the
examples you mailed the other day,
a - b - c
 \    X
 d - e - f

In this case if we shallow clone from b, and when we want to
merge c and f or those branches later, we miss e which is actually
the parent Mercurial chooses for Merge. We need to take care of this.
The solution proposed now doesn't allow such merges at all, making the
merge node partial node. Further the problem is, we also need to make
node c partial in this case, since we won't have e which is the parent
of that merge. Again this is a partial node.

Having partial node has a lot of problems which we need to solve. First
of all hash is calculated using full data in changeset. When we have
partial data we may have to recompute the ID. This causes a problem
though, if we use those changesets later for other push and pull operations
This also requires a lot of change in Mercurial code. So we plan to make
this as transparent as possible to the code. But again we must note that
hg verify has a problem here. If we have partial data, we cannot verify
it properly. During PULL from this shallow clone we get partial nodes and
the original ID. If verify just uses this ID to verify, then partial data
can
contain any malicious code. This is the problem of biggest concern, because
security is always a threat. We need to implement a mechanism for
hg verify to adapt to this change. This essentially means the partial
node mechanism changes are not fully transparent. So the plan now
is not to send partial information, but always send full info on the wire
and take only whatever is needed for the revlog locally. However we
are not discarding the real full data. We plan to keep it separately
in a separate file at the moment. We can as well keep it in revlog
itself. But that needs a lot of changes to revlog itself and huge lot
of code in mercurial not to consider parts of data that are no more
relevant.

Also the problems involved during push and pull are fairly complex too.
With the current implementation, all the nodes that are unknown
are sent. We must first take care to see to it that, nodes on unwanted
branches are not sent, since they are not known to the shallow clone.
Further bundle and unbundle may now contain partial nodes. The security
issue has already been said above. Apart from that, bundle should have
some mechanism of telling that it is from/for a shallow repo. Unbundle
should have a mechanism to know about the changed bundle format
(obvious ;-) ). Since we said above that we always send full information
only, unbundle should be changed a lot. We need to put in the partial
required data to revlog and non-required data to the other file we
plan to maintain. It also requires a lot of thought and discussion with the
community to decide upon the extra file format, changes in the code
to accommodate all these stuff etc.

All the above changes essentially require a change in Wire Protocol.
heads() must only return wanted heads, likewise branches() too.
Especially branches() need a good number of changes. Since branches
return tip, base, p1 and p2, it should return only wanted tips. base
must be obtained by checking if it is one of the children of the
shallow clone root, if not we must change branches() to adapt to
this. This means this branch has another sister branch that is an unwanted
branch, we must have to record this information while constructing bundle.
It may also be needed for unbundling. Further even if the base is part
of the shallow repo, if one of the parents is missing, i.e an unwanted
node, base now becomes a partial node. This is again an issue. We
must tackle this problem to tell that the one of the branches is
an unwanted branch and build the bundle such that is a partial node
after accepting the incoming node. Same problems extend to p1 and p2.
I am still not sure about this, but as I see now, I don't think we
may require considerable changes in between(). From what I said
above it is seems quite obvious that changegroup() requires a lot of
changes. Most of the changes it may require have already been
discussed with "base" when discussing branches. But that problem
is now extended to every node from the root of the branch upto
the tip. We must take care of all unwanted branches merging,
partial merge nodes and so on while creating changegroup.

Now coming to the problem of filerevs, it may so happen that
we clone at a single node root of the shallow clone. And a file
was not changed at that revision, so gets no mention in the revlog.
But sometime later that file might be changed in the history and
it seems to have come from no where. Also if we want to change
a file on our local branch which is not mentioned in the root, we
might not be able to get the changeset for the file. It might all
together look like a new file. Same thing is true when deleting.
If we attempt to do so, it will be like attempting to delete a non-
existing file. So our clone must have the knowledge of all the files.
We try to eliminate this problem by telling the manifest of the shallow
clone root about all the files even though they were not changed
in that changeset. This means that we may be listing all those files
that are no more relevant to the development at all. The manifest
of the root may now list lots of unwanted file names.

Also while performing merge with an unwanted branch, the unwanted
branch may have some new files added. We need to tell the partial
merge node about these filerevs, otherwise the problems said above
for root node crops up. These things must be done, even if those
files were not changed in the parent of the merge node in the
unwanted branch.

The problem of legacy ghosts us too. We may have a client that
is aware of shallow cloning, but the server might not be aware.
But client doesn't know this. Since so many changes are being
introduced all of a sudden, the problem of backwards compatibility
may arise. This is an issue because, Revlog format might be
changed a bit which means the entire code base of Mercurial
must adapt to this. But server may not know this. So if at all
we are going to make changes to Revlog format I need to involve
myself in a lot of community discussion and getting consent to
go ahead from the community, since changing revlog is an issue
of no smaller consequence. In this light, I also want to discuss with
the community to make changes to revlog or any other format that
requires a lot of changes later to make changes to accommodate
narrow clones too. Then I can start implementing narrow clones
after shallow clones without much trouble of changing all those
format related aspects again.

(I am just opening up a discussion by describing the problems I
have understood about Shallow Cloning. This is by no means fully
comprehensive. I would like to add upon more as the discussion
proceeds.)

-- 
Thanks and regards,
 Madhusudan.C.S

Blogs at: www.madhusudancs.info
Official Email ID: madhusudan at madhusudancs.info
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://selenic.com/pipermail/mercurial-devel/attachments/20090404/5e5209c7/attachment.htm