Turning cvs2{svn,git} into cvs2hg

Fri Jul 17 08:47:55 CDT 2009

Hi all --

[sorry for the crosspost, but I think this requires input from both
the cvs2svn and Mercurial development communities]

I long ago came to the conclusion that Mercurial's convert extension
is inadequate for handling real-world, industrial-scale CVS
repositories -- typically anything with branches or tags.  (I gather
git-cvsimport has the same problem, because it is based on the same
naive interpretation of CVS repositories.)  It appears that precisely
one tool is capable of comprehending the fearsome complexity of
real-world CVS repositories: cvs2svn.

So, I need to find a way to use all that existing code to convert CVS
to Mercurial.  The traditional answer has been a two-step conversion:
CVS to Subversion, Subversion to Mercurial.  OK, fine, I guess I can
do that if I *have* to.  But surely there is a better way.

And it looks like cvs2git + hg-fastimport should be the better way.
But there is a pretty big impedance mismatch between those two right
now, particularly in the way that cvs2git generates "fixup" commits to
turn CVS tags and branch points into something sensible for svn or
git.  (In a nutshell, my "small" CVS test repo with ~40 branches
becomes a 4000-head monster when I fastimport it into Mercurial.  If
you want to make Mercurial crawl, give it 4000 heads.  Not pretty.
And keep in mind that my real repository is ~8x bigger than the my
small test repo.)

I can see three ways to fix that mismatch:

1) modify the way cvs2git generates fixup commits so that
hg-fastimport does not create pathological Mercurial repos
2) write a filter that turns the fastimport dump created by cvs2git
into something that hg-fastimport + Mercurial handle nicely
3) modify hg-fastimport to handle those fixup commits directly

I think #1 benefits the most people, since it could potentially make
life simpler for git-fastimport as well.  (They could, in theory,
eventually drop support for the implementation quirk that cvs2git
takes advantage of.  Unfortunately, they promoted that implementation
quirk to a documented part of the syntax when cvs2git started using
it, so that seems unlikely.)  (Michael H.: this is my brief summary of
a thread on the git mailing list that you pointed out to me a few
months ago; if I'm summarizing inaccurately, my apologies.)

#2 stinks, because it benefits only me (and my employer).  But it's
probably the quickest/easiest way, unless I can bribe someone else to
do #1 for me.  ;-)  It's also difficult because it requires combining
adjacent (and possibly non-adjacent) commits in the stream, which is
difficult without reading the whole stream into memory.  And I'm
trying to avoid that for obvious scalability reasons.  (Another little
problem with "hg convert" is that it keeps too much in memory, which
makes large conversions hard.)

#3 might be slightly easier for me, since I already know hg-fastimport
better than I know cvs2git.  But it seems like a kludge: why implement
a complicated workaround for an upstream quirk when upstream is an
actively maintained open source project?

Now, here's where things get tricky.  No matter what I do, I need some
custom processing between cvs2git and our final Hg repository.  For
example, I want to parse our CVS commit messages and turn forward CVS
merges (from branch n to n+1) into Mercurial merges.  And I also want
to parse out our bug IDs and put them into the database table that
we're going to use to associate Mercurial changesets with bug IDs.
And maybe more.  These are company-specific policies that have no
place in either cvs2git or hg-fastimport.

So my desire to do custom post-processing on the fastimport dump, and
the difficulty of doing some of those steps in a stream-ish way, got
me thinking.  Maybe I should convert the dump to a hash file (eg.
Berkeley DB) so I can do random-access stuff, delete/combine commits,
etc.  So I started looking into cvs2git's output code, and it looks
like that's doable by subclassing GitOutputOption.

But wait... there are already two perfectly good ways to save a
fastimport dump to a random-access binary format: a git repo and an hg
repo.  Duh.  I don't really want to learn enough about git internals
to use its repo as an intermediate form.  But cvs2git + hg-fastimport
right now produce a pathological Mercurial repo.  And hg-fastimport
takes ages and vast amounts of memory to do it.

So now I'm thinking, to heck with fastimport.  Just write a new
backend ("output option") for cvs2svn that directly populates a
Mercurial repo.  Then I can use existing Mercurial tools and APIs to
turn that intermediate repo into my final product.

The benefits of this are fairly obvious:
  * no more awkward 2-step conversions for cvs->hg
  * conversion should run faster (and use less memory)

The drawbacks are more subtle:
  * not using hg's convert extension means my proposed cvs2hg would not
    benefit from one key feature of it, namely the toposort that produces
    a space-optimal hg repo (OTOH, my hg-writing backend would certainly
    depend on Mercurial's API, so I could hook in toposort somehow)
  * cvs2svn maintainers now have to worry about maintaining 3 personalities
    (OTOH, they already have a not-very-functional cvs2hg script + sample
    config in their source tree)
  * hg-fastimport might go back to being a neglected and unloved extension
    if I concentrate on cvs2hg

Sorry for the long email.  I tend to ramble when I'm thinking out
loud.  Curious to know what your reactions are.

Greg