[GSoC-2009] Shallow Clones

Madhusudan C.S madhusudancs at gmail.com
Wed Apr 1 10:24:43 CDT 2009


Hi all,
     This is a rough draft proposal I have come up with for Shallow clones
based on the Plans that Peter just sent in to the list. I request you all to
review it for me. Since I am running out of time and djc urged all students
to submit it to Google by today, I am submitting it there too. I continue
editing it and making changes as per community suggestions and opinion.
(Please point out if there are any kind of mistakes, including technical,
spelling and grammatical. Since this is the first version, I am doing all
the work to revise the proposal to correct such mistakes, your suggestions
will help me make it faster.)

~~~~~~~~~~~~~~~~~~~~~~~
 Title: Shallow clones
~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~
 Abstract
~~~~~~~~~

Greetings!

I wish to complete the implementation of Shallow clones in Mercurial
which is already a work in progress. I wish to implement shallow
clones for repos with multiple branch histories. I also re-implement
push, pull, merge, branch, verify, log and update to work with shallow
clones whereever necessary. I will make shallow cloning as transparent
as possible to the existing hg code without changing it much whereever
feasible. If not re-implement those parts where a change is required
without breaking any existing code.  I will also add a good number of
test cases to cover the smaller issues that may arise from this
re-implementation.


NOTE: The terminologies used in this proposal are suggested by Peter.
      I will upload the suggested terminology in my blog. Ref-[8]
~~~~~~~
 Why?
~~~~~~~

  Mercurial so far doesn't have any support for Partial Clones.
Partial Clone includes two aspects, they are:
1. Narrow Clones: Cloning only a subset of files and directories in
   the repo.
2. Shallow Clones: Cloning only the most recent history, i.e the
   history that is only relevant to us or the one that is most
   important.

The reason why Shallow Cloning is required is that, Mercurial stores
entire history as a collection of changesets between revsions of the
repo. As the project grows big and few major architectural changes
have been made in the course of its development some of the previous
history may become irrelevant to at least some people for further
development. It sometimes so happens that the history grows to
hundreds of Megabytes which is clearly a waste of bandwidth to clone
it fully when everything is not required. So this project attempts
to tackle this issue by providing an option for shallow cloning the
Mercurial repositories.

~~~~~~~
 How?
~~~~~~~

  The project begins with making small but useful changes to the
existing RevlogNG format to accomodate Shallow Clones, if possible
changes are made to accomodate Narrow Clones too, since I am
considering Narrow clones too if time permits in the GSoC time line
or will take it up post the GSoC. Peter currently uses bit 0 of
revlog which is used by hg verify to flag as error about missing
parent revs. This also includes making the existing code to adapt
to the changes to the revlog format.

Shallow Clone:
  The project then proceeds by implementing cloning of repositories
by allowing shallow clone to happen at a single rooted node(Changeset).
This is done by user as follows:

hg clone [--root <rev>] src target
(The client support is added at this stage itself to facilitate
further testing)

  Since shallow cloning repos with linear history has already been
dealt, the project implements the cloning of repos with non-linear
history. It implements the handling of merges where one of the parent
of the branch is not part of the shallow clone. The plan is not to
allow such merges and unwanted parent will be set to Null.

  Thus the merge node becomes a partial node. Even though it is
partial, we do not re-hash to re-compute the node ID, but retain the
previous one. This is to make the partial node transparent to existing
Mercurial code. But the code that tests for Revision ID over the data
will be changed to adapt to this scenario. Further we keep the
unwanted data corresponding to the partial node in a separate file
whose format is similar to the revlog format.

  Also during such a merge all the filerevs(including unchanged ones)
in the manifest of the unwanted parent node are added to the manifest
of the merged node.

  Coming back to the cloning itself, whenever we shallow clone a repo
rooted at a single node we include all the filerevs in the manifest of
the root node even though they are not present in the root's changeset.

Branch and Merge:
  The project then proceeds by implementing "branch"-ing and
"merge"-ing. During merges any reference to unwanted nodes will be
handled by:
- If Second Parent is unwanted, it is made Null
- If First Parent is unwanted, it is set to the Second Parent and
  Second Parent is set to Null.
- If it is some other changeset, then the reference is forwarded to
  Shallow clones root.

  Also, during merging, as mentioned above we don't allow merges from
full repo if the branch is an unwanted branch, thus again making the
merge node partial node. All other rules as said before for root node
apply. Further if we encounter any other merges which has the partial
node in the path to its ancestor we don't allow such merges. However
an option to forcibly overcome this behaviour will be provided and
implemented. Branching will be quite straight forward as long as the
branch nodes are not partial nodes, which will mostly be the case.
However if it is a partial node, the above said rules will be used
during merging.

Push and Pull:
  The project then proceeds implementing Push and Pull between shallow
repos and full repos and between shallow repos themselves. This phase
starts by determining the shallower root of the two repos, which is
implemented using the algorigthm suggested by Peter (Ref-[9] - under
Determining Shared Root). Now I change the implementation of
repo.findincoming() a bit so that it filters out all the unwanted
nodes before sending. This is done by using the shallower root passed
as the argument to findincoming().

  Further while sending nodes during push, the header of the bundle
is changed to include the target's root. Also while sending we will
send the full real data even though it is not needed. This is because
we have an additional file where we store the unwanted data. The
unwanted data is however not written to the actual revlog. We also
add in a flag for each target shallow revision to be sent.

  We also further implement sending deltas against the nodes that the
Target knows only. These deltas must also be only against Source known
nodes only. In case there is a node one of whose parents in unknown,
delta is calculated against other parent. Also I implement sending
all the filerevs that are unknown in the target at this stage. This
essentially means we send all the filerevs that are in the source's
wanted nodes manifest but not in target's root manifest. In case
of bringing in a merge I implement sending only those filerevs in
the unwanted parent's manifest and not in wanted parent's manifest.

  The project then proceeds implementing how the incoming nodes must
be added in the target. For this, whenever we have unknown Parent ID
mentioned we see if the bundle has the same root as the target or
target shallow revision flag is set for every revision in the bundle.
If so we abort the pull operation or else unwanted Parent IDs are set
to Null and the real information is added to the above mentioned
supplementary file. It also includes implementing check for unknown
filerevs, use them and put the previous linkrev to the supplementary
file.

Verify:

  For this whenever we encounter a partial node we check for original
information stored in another file. However the complete algorithm for
this needs to be worked out.

  Following are the basic test scenarios I am going to consider during
the due course of the project(This is in no way complete and
comprehensive. I will add the corner cases to this list as and when
I indentify them).

- Shallow clone a full repo.
- Clone a Shallow repo.
- Shallow Clone a shallow repo.
  Each of the above three for cases of linear history and multiple
  branch histories.
- Push to full repo.
- Pull from the full repo.
- Push to the Shallow repo.
- Pull from the shallow repo.
- Branch and merge within Shallow repos and its child repos.
- Merge a branch from the full repo in the shallow repo and vice-versa.
- Merge a branch from the full repo in the shallow clone of a shallow
  repo and vice-versa.
- Push to the full repo from the shallow clone of a shallow repo.
- Pull from the full repo to the shallow clone of a shallow repo.
- Merge a branch from the full repo in the complete clone of a shallow
  repo and vice-versa.
- Push to the full repo from the complete clone of a shallow repo.
- Pull from the full repo to the complete clone of a shallow repo.
- hg verify on above said type of repos, shallow, shallow clone of a
  shallow and complete clone of a shallow.


~~~~~~~~~~~~~~~~~~~~~~~
 Benefits to Mercurial
~~~~~~~~~~~~~~~~~~~~~~~

  At the end of this project Mercurial users will be able to shallow
clone from the full repo, shallow clone the shallow repos, thereby
saving a lot of bandwidth and also get rid of the history that they
don't want any more. They will also be able to continue working on
these shallow repos as if they were full repos. They will be able to
branch, merge, push, pull and do other common operations on this repo.
They will also be able to bundle and unbundle the shallow repos.

~~~~~~~~~~~~~~
 Deliverables
~~~~~~~~~~~~~~
  1. Code to perform Shallow clones on both full repos and other
     shallow repo.
  2. Code to perform other operations like push, pull, merge, verify,
     bundle and unbundle
  3. Test cases, which covers good number of corner cases that might
     arise from shallow cloning.

  Non-Code deliverables include testing performed at various phases to
verify the correctness of the code and features implemented in
general. Also detailed user and development documentation for using
using Shallow clones in Mercurial will be added.


~~~~~~~~
 When?
~~~~~~~~

  The project is planned to be completed in 7 phases. Every phase
includes documenting the progress during that phase. Initially I put
my design decisions and planning documentation in the shallow clone
repo. The timeline for each of these phases is given below:
  1. Design Decisions and Initial preparation(Community Bonding Period
     : Already started - May 22nd )
        Closely working with Mercurial community to learn more about
        Mercurial code base in depth, Mercurial's approaches to
        different types of problems I may encounter in the due course,
        reading internals documentations. Communicating and discussing
        with the Mercurial community and mentor to resolve the
        outstanding issues and potential blockages for the progress.
        Design decisions are discussed and finalized. Identifying
        corner cases. Also I will submit as many patches as possible
        to gain a better understanding of the Mercurial code base.

  2. Coding and Testing Phase I (May 22th – June 4th )
        Implement shallow clones from repos(Shallow and Full) with
        non-linear history. Implement unwanted merges and construction
        of Partial nodes. Add and verify test cases.

  3. Coding and Testing Phase II (June 5th – June 21st )
        Implement Branch and Merge in Shallow repos. Add test cases
        and verify them.

  4. Coding and Testing Phase III (June 22nd – July 12th )
        Implement Push and Pull, Bundle and Unbundle. Add test cases
        and verify.

  5. Coding and Testing Phase IV (June 13th – July 20th )
        Implement Verify. Add test cases for them. Add all possible
        combinations of test cases for all the above introduced
        features.

  6. Requesting for community wide Reviews, testing and evaluation
    (July 21st – July 27th )
        Final phase of testing of the overall project, obtaining and
        consolidating the results and evaluation of the results.
        Requesting community to help me in final testing and closely
        working with testers.

  7. Scrubbing Code, Wrap-Up, Documentation
     (July 28th – August 3rd )
        Fixing major and minor bugs if any and merging the project
        with the Mercurial crew if possible. Writing User and
        Developer documentations and finalization. This documentation
        will be written as per the requirements of Mercurial
        Docuementation Guidelines to put it in to the docs of the
        Mercurial crew.

  Since the above project is scheduled to be completed in 11 weeks,
and we have 12+1 weeks in hand for GSoC, I will use the remaining 2
weeks to start working on the design of narrow clones, come up with a
rough plan and start implementing it. However I would continue to work
on it outside GSoC. In case I run out of time, I will make sure I
complete the work of Coding and Testing Phase III atleast and finish
Reviews, Testing and Documentation for whatever I have completed and
continue working on the left out phases after GSoC.

~~~~~~~~
 Where?
~~~~~~~~

   I am already comfortable with using the mercurial-devel mailing
list and IRC channel #mercurial at freenode.net. I will be able to
contact my mentor in both of the above two forms and will also be
available through google-talk(jabber). I am also comfortable with
using mercurial too since I have been using it from around 6-8 months

~~~~~~~~~~
 Why Me?
~~~~~~~~~~

  I am a 4th Year undergraduate student pursuing Information Science
and Engineering as a major at BMSCE, Bangalore, India(IST). Have been
using and advocating Free and Open Source Softwares from past 5 years.
Have been one of the main coordinators of BMSLUG. Have given talks and
conducted workshops on various FOSS tools:
- Most recently I conducted a Python and Django workshop at National
  Institute of Technology, Calicut, a premium Insititution around.
- How to contribute to FOSS? - A Hands-On hackathon using GNUSim8085
  as example.

http://groups.google.com/group/bms-lug/browse_thread/thread/0c9ca2367966727a
- Have been actively participating in various FOSS Communities by
  reporting bugs to communities like Ubuntu, GNOME, RTEMS, KDE.
- I was a major contributor and writer of the KDE's first-ever
  Handbook.
  http://img518.imageshack.us/img518/9796/hb1o.png
  http://img518.imageshack.us/img518/4296/hb2.png

  I have been contributing patches and code to various FOSS
communities, major ones being:
- GNUSim8085 (http://is.gd/p5wZ , http://is.gd/p5xK)
- KDE Step (http://is.gd/oci7)
- RTEMS
- Have been involved in Django community, but haven't written any
  code yet.
- Melange (The GSoC Web App.
  http://code.google.com/p/soc/source/browse/trunk/AUTHORS)

My Mercurial Work:
  I have been working and discussing with Peter Arrenbrecht about
implementing Shallow clones in Mercurial from some time. Have also
been working on the exercises he is giving me for implementing Shallow
clones. Ref-[7]. I have also sent a patch to Mercurial. Ref-[4][5]. I
have also read a fair amount of mercurial internals code while working
on those exercises Peter gave me. I have also helped to solve another
problem. Ref-[6].

  I have a fair understanding of concepts of Python and have One and
half years of Python experience. I have a fair understanding of
mercurial internals and its working and can follow its code without
any problem. I am also using Mercurial for some project from around 6
months.

  Since I have been working with FOSS communities I have a good
understanding of FOSS Development methodologies of communicating with
people, using issue tracker at code.google.com, coding and testing.
I have been using code.google.com for one my academic projects.

  Lastly I want to express my deep commitment for this project and
Mercurial in general. I'm fully available this summer without any other
commitments and will be able to tune my day/night rhythm as per my
mentor's requirement and assure a dedicated work of 35-40 hours/week.
Also I will assure that I will continue my commitments with Mercurial
well after GSoC. If you find any part of this proposal is not clear
please contact me.

~~~~~~~~~~~~~~~~~~~~~~~~~~
 Important Links and URLs
~~~~~~~~~~~~~~~~~~~~~~~~~~
[1]  My Blog: http://madhusudancs.info
[2]  My CV :
http://www.madhusudancs.info/sites/default/files/madhusudancsCV.pdf
[3]  IRC Nick: madrazr Server: irc.freenode.net
[4]  My patch to Mercurial:
http://selenic.com/pipermail/mercurial-devel/2009-March/010954.html
[5]  Commit to Crew:
http://hg.intevation.org/mercurial/crew/rev/7bcce39e8f07
[6]  My other work(not a patch):
http://www.selenic.com/mercurial/bts/issue1544
[7]  Discussion about Shallow clones -
http://selenic.com/pipermail/mercurial-devel/2009-March/010880.html
[8]  Suggested terminology - http://madhusudancs.info/shallow-terminology
[9]  Peter's Plan - http://bitbucket.org/parren/hg-shallow/wiki/Plan


-- 
Thanks and regards,
 Madhusudan.C.S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://selenic.com/pipermail/mercurial-devel/attachments/20090401/0adcef09/attachment.htm 


More information about the Mercurial-devel mailing list