Turning cvs2{svn,git} into cvs2hg

Greg Ward greg at gerg.ca
Fri Jul 31 15:45:35 CDT 2009

[Background, if you had forgotten about this thread: on July 17, I
crossposted to mercurial-devel and dev at cvs2svn, looking for anyone to
talk me out of adding a new backend to cvs2svn that writes directly to a
Mercurial repository.  Nobody did, so I've been busy working away on

This post is partly status report, and partly reply to Michael
Haggerty's post.  The status is pretty good: as of yesterday, my cvs2hg
successfully converts cvs2svn's test-data/main-cvsrepos (one particular
test input); as of today, it passes a bunch of automated tests based on
that conversion.  There are still 77 other CVS repositories in that
test-data directory, though, and I'm willing to bet that every one of
them exhibits some novel usage of CVS that has been seen somewhere in
the wild.

On Sun, Jul 19, 2009 at 2:41 PM, Michael Haggerty<mhagger at alum.mit.edu> wrote:
> IMHO the impedance mismatch has nothing to do with cvs2git vs
> hg-fastimport.  It is between CVS/Subversion's model of branches and
> tags vs git/Mercurial's.  (At my basic level of understanding of
> Mercurial, its branching model seems quite similar to git's.  Maybe
> I'm wrong.)

Fundamentally true: git and Mercurial both really only work on a whole
tree.  That's a fundamental design feature, and probably the biggest
difference between the git/hg/bzr camp and cvs/svn.  (Well, that and the
centralized vs. distributed thing.)

> Now, the most minimal, unambitious, sine qua non requirement for a
> cvs2hg conversion tool is that the results of checking out a tag or the
> tip of a branch in CVS and Mercurial should be identical.

Agreed.  Which is why I'm surprised that I haven't stumbled across a
tool that will validate a CVS->{svn,git,hg} conversion for me by
comparing checkouts.  Does everyone who hacks on this stuff just write
their own?  (I have, but I didn't really want to. And I don't like it.)

> CVS and Subversion allow branches and tags to be used in ways that would
> be considered blasphemous in the DVCS world, and people really use these
> features as important parts of their workflow.  For example, in CVS you
> can do things like
> - Tag (or branch) a subset of files from the source branch.
> - Add some files from branchA and some from branchB to a tag or branch.
[...several other CVS peculiarities elided...]
> Therefore,
> regardless of the tool, we need a way to represent all of the above
> situations in Mercurial.  Decide that and the rest is a simple matter of
> programming.

Not *that* simple.  Of course, I'm still finding my way around the
internals of cvs2svn.  Thank you for the extensive and copious

Anyways, my basic technique for fixup commits is:

  * pick a source revision (using the logic already
    implemented in GitOutputOption._get_source_groups())
  * use that as the first parent of the fixup commit
  * in the fixup, explicitly delete any files that do not have the tag /
    are not on the branch
  * pick and choose file revisions to match the tagged/branched

This is a pragmatic approach designed around the following constraints:

  * CVS checkout must equal hg checkout (Michael's "sine qua non"
  * trying not to abuse the mechanics of merge changesets: in
    particular, I do *not* claim that a fixup commit is a merge because
    it includes file contents from >1 revision.  First, unlike git,
    Mercurial has a hard limit of 2 parents.  Second, such a fixup isn't
    really a merge, so I don't want to make it look like one.

> Another issue that has not been resolved satisfactorily is what to
> record in the DAG for CVS branches that do not start as 1:1 copies of a
> source branch.  Should a source branch be chosen to be the parent of the
> new branch anyway?

I use the same trick for tag fixups and branch fixups.  (After all,
branches are just weird tags in CVS.  And Subversion tags are the same
as branches.)  Right now, I'm arbitrarily picking one source revision as
the branch parent.  Would be nice if I could select a "best fit" source
revision rather than an arbitrary one.

> And what should happen to the DAG if files are added from a source
> branch to an existing branch?  Should the commit be considered a merge
> with the new source branch as the second parent, even though the source
> branch might have other content that wasn't merged over?

I don't think so.  It's not really a merge; it's just copying content
from another branch.

> Currently,
> cvs2* always creates a merge commit whenever any content is added from
> one branch to another, but I am skeptical that this is the best behavior.

Me too, which is why my cvs2hg does not represent fixup commits as

> It could very well be that Mercurial has no way of representing a
> general CVS repository, or that the only way is prohibitively
> inefficient.

I don't think things are *that* bad.  If the goal is "checkout parity",
then fixup commits are a great hack to workaround weird CVS tags.  If
the goal is nice-looking Mercurial history ...

> In that case it would be nice if
> the tool offered the user a way to selectively discard information in
> such a way as to maximize the value of the resulting repository.

...exactly.  You got there ahead of me, but I got there in the end.

Anyway: IMHO the default behaviour should be "checkout parity"
i.e. checking out the same tag/branch in CVS and Mercurial should give
the same tree.  That's what I'm working to implement now.  Then I'm
going to work on options to selectively discard information to make a
prettier history.  (E.g. "if a fixup commit only deletes files because
this was just a CVS partial tag, don't bother with the fixup: let the
Mercurial tag be a superset of the CVS tag".)

> It's premature to think about technical solutions before the conceptual
> decisions have been made.  But all else being equal, I think that having
> a lingua franca for DVCSs would be a big advantage, and git-fast-import
> format is the only obvious contender right now.

I still think fastimport is a very good thing for git<->bzr<->hg
conversions.  I just think that adding an extra layer between CVS and
Mercurial makes things harder.  I can make a better Mercurial repository
if I have direct access to cvs2svn's internal data structures, since
they tell me more about the original CVS repo than a fastimport dump
ever would.


More information about the Mercurial-devel mailing list