Repository Design Question

Wed Jul 18 15:41:55 CDT 2007

Greetings, all.

I and my professor are attempting to design a repository for storing the
code for the local operating systems class.  We have been using mercurial
repositories for about a year and so have some experience with the system,
but it is entirely possible that (potentially quite useful) features have
remained outside our experience envelope.

Some Background
---------------

In order to understand our requirements, some (simplified) overview of the
course, from the student's perspective, will probably be helpful.

Project 1 is writing device drivers for a PS/2 keyboard and text-mode VGA
console and exercising those drivers with a minimal kernel-mode game.  The
project tree looks like

 410kern/ -- support code we provide (malloc(), sprintf(), etc)
 kern/ -- code students write (device drivers & game)

Project 2 is writing a user-land thread library which runs on a kernel we
provide in binary from.  The project tree looks like

 410kern/
  kernel.o
 410user/
  lib/ -- user-land library support code (a different malloc(), sprintf())
  progs/ -- thread-library test programs we provide
 user/
  lib/ -- student-written thread library code
  progs/ -- student-written test programs

Project 3 is writing a kernel to replace the binary blob in Project 2.  The
project tree looks like

 410kern/ -- support code we provide (malloc(), sprintf(), etc)
 410user/
  lib/ -- user-land library support code (a different malloc(), sprintf())
  progs/ -- kernel test programs we provide
 user/
  lib/ -- student written thread-library code
  progs -- student written test programs

Internally, there are many commonalities between parts of these project
trees:
 * 410kern/ is almost identical between P1 and P3
   [but totally different in P2]
 * 410user/ is similar but not identical between P2 and P3.  Notably, the
   set of test programs differs dramatically.
 * Some code (e.g. sprintf()) is identical between 410kern and 410user.
 * Other similar relations exist (trimmed from this example).

The student's perspective on assignment life-cycle is that we ship a tarball
(generated by cloning our repository and deleting the .hg directory) for
each project.  The students add code and ship back an augmented tarball.
(The reality is more complex, but the point is that the students get an
un-annotated directory tree, not a full repository.  Many students
internally use hg, svn, or another revision control system.)

Our Current Plan
----------------

The current situation is a bit of a disaster, requiring a repository for
each project with changes manually copied back and forth between them;
usually when we remember or somewhat lazily as each project approaches.

What We'd Like
--------------

What we would like to do is treat each project as a branch in a single
repository.  For example, on the P2 branch, most of 410kern/ is deleted and
a kernel.o is added.  410user/progs originally contains all test programs
but on the P2 branch the P3 tests are deleted and on the P3 branch the P2
tests are deleted. [We do not require students to fix bugs in their thread
library in order to do well on the kernel assignment, so we remove the
library-specific tests from the P3 tarball.]

In this model, there will be recurring merges between the P2 and P3 branches
as bugs are discovered and fixed.  For example, if we fix a bug in common
code, e.g., malloc(), the result will be an undesirable divergence between
the branches, which we would like to resolve by a merge.  Since such
divergences will occur over time, such merges will occur multiple times.
However, some divergences are permanent -- for example, tests which are
deleted on one branch should stay deleted on that branch.  The situation is
repeated at the level of individual files: on the P3 branch, malloc.c needs
locking hooks which should be absent from the P1 branch, but any other
change made to the kernel malloc() should be reflected to the other branch
via merges.

In short: when we merge the P2 branch against the P3 branch (which is
different from "merge P2 with P3"), deleted files on each branch should stay
deleted (meaning mercurial shouldn't ask and shouldn't recreate or delete
files inappropriately); each time the P1 and P3 branches are merged,
malloc.c should only be considered for merging if it has changed on either
branch since the last merge, and ideally only the part of the file that
changed should be viewed as conflicting.  That is, the locking code in P3
should be seen as persistent on the P3 branch.

Mercurial seems to handle situations like malloc.c reasonably well, but its
file deletion behavior, in this circumstance, leaves something to be desired:

## [A] Initialize repository and import the full test suite
$ hg init
$ touch p2only p3only
$ hg commit -A

## [B] Specialize the P2 branch
$ hg branch P2
$ hg rm p3only
$ hg commit

## [C] Go back to common and specialize the P3 branch
$ hg update -C 0
$ hg branch P3
$ hg rm p2only
$ hg commit

## [D] Let's see what happens if we merge
$ hg merge P2
$ ls
## Because we are on the p3 branch, we expect to see p3only and have this
## merge be a no-operation.  However, the result is an empty directory.
## Unfortunately, we must revert the deletion of p3only to the P3 branch
## and commit this as the merged state.
$ hg revert -r P3 p3only
$ hg commit

## [E] What happens if we merge P3 into P2 now?
$ hg update -C P2
$ hg merge P3
## We should have a repository containing p2only and this merge should again
## have been a no-operation.  What we actually have is only p3only -- p2only
## is gone.  Again, we must revert to put things back to where they should be.
$ hg revert -r P2 p2only p3only && rm p3only.orig
$ hg commit

## The repository history now looks (works best with a fixed-width font) like
##
## A-B---E (P2)
##  \ \ /
##   C-D (P3)

## If a bugfix is made on P2...
$ echo foo > p2only
$ hg commit
$ hg update -C P3
$ hg merge P2
## Again this merge should have been a no-operation, as no files on the p3
## branch were affected by the changes made in the p2 branch.  Instead, the
## result is that p3only is deleted and the updated p2only is imported.

Fundamentally, the design-decision reason for this behavior seems to be that
branches are ephemeral and intended to be merge-once.

Sad Alternative
---------------

An alternative approach, which we like less well, would be to encode the
branching in a directory tree, by, e.g., splitting 410user/ into 410user.P2,
410user.P3, and 410user.shared/.  To avoid a combinatorial explosion for our
students, a level of indirection is necessary on checkout: a "generate P2"
script would create a merged 410user from 410user.shared and 410user.P2,
similarly for a "generate P3" script.  We would also need a "propagate
change" script which would know that a change to 410user/progs/testfoo.c
should propagate back to 410user.P2/progs/testfoo.c while a change to
410user/lib/sprintf.c should be copied back to 410user.shared/lib/sprintf.c.
This approach has some difficulties with files like malloc.c; the tool would
probably have to ask us if a diff were intended to be common or specific to
the current composed view.

At present, manually propagating changes between repositories is painful,
error-prone, and diminishes the history-preserving properties of the
revision control system.  The merge-script/propagate-script approach seems
rather painful to think about, document, and implement.  Are we overlooking
a way to make the desired mechanism, namely branching, work?

Thanks much in advance.
--nwf;