Turning cvs2{svn,git} into cvs2hg

Michael Haggerty mhagger at alum.mit.edu
Fri Jul 31 17:16:40 CDT 2009


Greg Ward wrote:
> On Sun, Jul 19, 2009 at 2:41 PM, Michael Haggerty<mhagger at alum.mit.edu> wrote:
>> Now, the most minimal, unambitious, sine qua non requirement for a
>> cvs2hg conversion tool is that the results of checking out a tag or the
>> tip of a branch in CVS and Mercurial should be identical.
> 
> Agreed.  Which is why I'm surprised that I haven't stumbled across a
> tool that will validate a CVS->{svn,git,hg} conversion for me by
> comparing checkouts.  Does everyone who hacks on this stuff just write
> their own?  (I have, but I didn't really want to. And I don't like it.)

You are quite right that cvs2git has a burning need for such a
verification tool.  I built some (very!) minimal tools verifying CVS ->
git conversions and contributed them to the git project for their tests
of git-cvsimport.  It can be found in the git "next" branch under
t/lib-cvs.sh and t/t96*.

It would be great to have a generic basic "sine qua non" verifier that
could be applied to any of the outputs.  Even if it is less
comprehensive than our test suite, it would give a lot of confidence
about conversions.  Or, alternatively, it would reveal a lot of bugs
that we could fix :-)  The lack of such a tool has been a burden on my
conscience for quite a while :-/

>> [...]
>> Therefore,
>> regardless of the tool, we need a way to represent all of the above
>> situations in Mercurial.  Decide that and the rest is a simple matter of
>> programming.
> 
> Not *that* simple.  Of course, I'm still finding my way around the
> internals of cvs2svn.  Thank you for the extensive and copious
> docstrings!
> 
> Anyways, my basic technique for fixup commits is:
> 
>   * pick a source revision (using the logic already
>     implemented in GitOutputOption._get_source_groups())
>   * use that as the first parent of the fixup commit
>   * in the fixup, explicitly delete any files that do not have the tag /
>     are not on the branch
>   * pick and choose file revisions to match the tagged/branched
>     revisions

What is different here to what we currently do?  I see:

1. Don't clear the "first parent" tree and start from scratch; rather,
start from the "first parent" contents and fix it up.  (This was not
done for git because the current approach is a little bit easier and the
git importer didn't care one way or the other.)

2. Don't consider the sources of other file revisions that have to be
added to the branch as merge parents; instead just copy the file
revisions into the branch without adding another parent.

Given the data structures available in cvs2svn, both of these should be
quite doable, I would hope.

>> Another issue that has not been resolved satisfactorily is what to
>> record in the DAG for CVS branches that do not start as 1:1 copies of a
>> source branch.  Should a source branch be chosen to be the parent of the
>> new branch anyway?
> 
> I use the same trick for tag fixups and branch fixups.  (After all,
> branches are just weird tags in CVS.  And Subversion tags are the same
> as branches.)  Right now, I'm arbitrarily picking one source revision as
> the branch parent.  Would be nice if I could select a "best fit" source
> revision rather than an arbitrary one.

If you are using the first value returned by
GitOutputOption._get_source_groups() as the "arbitrary" parent, then you
are already using the "best fit" in the sense that it is the one
requiring the fewest add/change fixups.  (IIRC, the code is a little
weak about optimizing the number of deletes for purely technical reasons.)

>> And what should happen to the DAG if files are added from a source
>> branch to an existing branch?  Should the commit be considered a merge
>> with the new source branch as the second parent, even though the source
>> branch might have other content that wasn't merged over?
> 
> I don't think so.  It's not really a merge; it's just copying content
> from another branch.

Well, it's not *necessarily* a merge.  But if the user had done a real
merge in CVS, it would also look like this.

But I agree with you that it is presumptuous of cvs2svn to declare that
such events are all merges.  For git, I think that your choices are the
best compromise, with the addition that it would be friendly for cvs2git
to output an extra "grafts" file containing all potential merges.  This
file would be a good starting point for users who want to reconstruct
merges by hand or using other criteria (for example, maybe by matching
log messages against a regexp).

> Anyway: IMHO the default behaviour should be "checkout parity"
> i.e. checking out the same tag/branch in CVS and Mercurial should give
> the same tree.  That's what I'm working to implement now.

I agree that this should be the default.

> Then I'm
> going to work on options to selectively discard information to make a
> prettier history.  (E.g. "if a fixup commit only deletes files because
> this was just a CVS partial tag, don't bother with the fixup: let the
> Mercurial tag be a superset of the CVS tag".)

Yes, this sounds like a good idea that would cover many cases.  I sure
hope that you implement this in a way that it can be used for git and
bzr as well as hg...

>> It's premature to think about technical solutions before the conceptual
>> decisions have been made.  But all else being equal, I think that having
>> a lingua franca for DVCSs would be a big advantage, and git-fast-import
>> format is the only obvious contender right now.
> 
> I still think fastimport is a very good thing for git<->bzr<->hg
> conversions.  I just think that adding an extra layer between CVS and
> Mercurial makes things harder.  I can make a better Mercurial repository
> if I have direct access to cvs2svn's internal data structures, since
> they tell me more about the original CVS repo than a fastimport dump
> ever would.

The question is not whether the *current* cvs2git output in
git-fast-import format is optimal for a cvs2hg conversion.  The question
is whether *it is possible* to represent an optimal cvs2hg conversion
within git-fast-import format and whether that intermediate format
imposes too high an overhead on either cvs2hg or hg-fastimport.  I still
don't see any reason that the changes you want cannot be represented in
git-fast-import format.  (Do you?)  Nor do I see that you plan any
changes that wouldn't also be beneficial to cvs2git.  (I don't know
enough about bzr to speak about that.)  Therefore, I still think that it
would be very advantageous to continue using the git-fast-import format
as the lingua franca.  And I sure hope that you plan to submit patches
back to the cvs2svn project.

By the way, there is also a possible problem related to the license of
cvs2svn vs. those of the DVCSs.  IANAL but it seems to me that the
CollabNet license is *not* GPLv2-compatible.  Therefore it might not
even be possible to create a tight integration of cvs2svn and hg code.
It might be possible to get CollabNet to change the license that is
applied to cvs2svn (for example, I would certainly support a change to
GPLv2), but those discussions haven't even been started.

Michael



More information about the Mercurial-devel mailing list