[PATCH] convert: make convert from cvs recognise branch points correctly [revised, 3]

Michael Haggerty mhagger at alum.mit.edu
Tue Apr 21 08:43:13 CDT 2009

Patrick Mézard wrote:
> Henrik Stuart a écrit :
>> # HG changeset patch
>> # User Henrik Stuart <henrik.stuart at edlund.dk>
>> # Date 1239706830 -7200
>> # Node ID f8fd162954633e67b27605b2541eb46e155d4780
>> # Parent  aece3c9e62f1a685e4cebd3ecf0b22e1c8101ae1
>> convert: make convert from cvs recognise branch points correctly
>> This patch fixes issue 1447 by using the symbolic names for branches
>> in the rlog output to determine the branch point for each file and
>> then finding the latest possible branch point in the given changesets
>> such that all the file revisions match and the date is earlier than
>> the branch commit. For commits on non-branches, the current
>> functionality is maintained.
>> Co-contributor: Greg Ward <greg-hg at gerg.ca>
> [snip]
> My opinion for today (or today morning, let's not be too optimistic)
> is the patch looks correct but uses too many side-effects/implicit
> assumptions. To answer my own question about the reuse of "e" in:
> +                        if [e.file for e in qcs.entries if 
> +                                e.file in seen_branched_files]:
> +                            break # we cannot progress past this place
> +
> +                        branchesto = getattr(e, 'branchesto', None)
> I think the behaviour is correct because:
> - changesets always have at least one log entry
> - you make the implicity assumption later that all "branchesto" of
> log entries of a given changeset are equal. I think we should expect this,
> reason to fail would be some issue with the changeset per date
> clustering code, or the possibility of branching a subset of the working
> directory. I don't know if we should be concerned with the latter or
> not, you tell me.

I'm not so familiar with the cvsps.py code, but I have been working on
cvs2svn/cvs2git/cvs2hg [1] for years so I know a lot of things that can
go wrong with a CVS conversion.

Date clustering doesn't work robustly because CVS fundamentally has a
file-by-file view of the world.  It is quite common to see things like
the following:


a.txt                             b.txt
                                   1.2 <- BRANCH1
 1.2                                |
  |                                 |
 1.1 <- BRANCH1                     |

Note that a.txt:1.1 and b.txt:1.2 are both in BRANCH1, even though they
never coexisted on the main line of development (a.txt:1.1 was
overwritten by a.txt:1.2 before b.txt:1.2 was created).


a.txt                             b.txt
 1.2 log message "foo"
  |                                1.2 log message "bar"
  |                                 |
 1.1 log message "bar"              |
                                   1.1 log message "foo"

Note that this often happens within the small window of time used to
collect file commits into changesets.  There is no way to topologically
sort the two changesets without breaking up either the "foo" or the
"bar" changeset.

a.txt                             b.txt

                   BRANCH2                           BRANCH1
                      |                                 |
         BRANCH1      |                    BRANCH2      |          
            |                                 |
 1.1--------/                      1.1--------/

Note that BRANCH1 and BRANCH2 have different topologies in the two
files, so there is no way to choose which one "branches off" of the other.


It is also very common for CVS repositories to have incorrect
timestamps, even timestamps that are out-of-order within a single file.
 This can be caused, for example, by CVS *clients'* clocks being
inaccurate or (as in the case of GCC's repository) because the server's
CMOS battery was dead and its clock reset to 1970 after each reboot.

These are just a small sample of the kind of problems that are *often*
seen in CVS repositories, except of course that even these simple types
of pathologies often involve hundreds of files.

> What still worries me is we scan the [p, i) revision index range without
> checking the revision branches. I think the code mostly works by relying
> on strong implicity graph properties about the revision ordering. Here
> is case I failed to test for my cvs-fu is too weak. I try to make the
> parent selection code pick up the wrong branch, by playing on the
> updated file dependencies (nodes are changesets, annotated with changed
> file versions):
> . b,
> |
> |
> |                   . c,1.4
> |                   |
> |                   |
> \_____. a,   |
>       |             |
>       \_____________.
>                     |
>                     |
>                     . a,1.3
>                     |
>                     |
>                     . b,1.3
>                     |
>                     |
>                     . c,1.3
>                     |             
> Why should the revision containing b, pick the revision
> containing a, as parent instead of the one containing c,1.4
> (assuming the initial wrong parent selected is the revision containing
> b,1.3).

There is no way to know without more information.  It is also possible
that a and/or c were never even added to the branch.

I encourage you to read the cvs2svn design documentation [1] and
features list [2] to see what we finally had to do to get a conversion
that is consistent with the CVS history.  The key is that the original,
naive changesets (i.e., equivalent to cvsps's *final* changesets :-) )
often have to be broken up in order to break cycles in the dependency
graph and allow the topological sort to succeed.  Otherwise the
conversion will be *objectively wrong* in the sense that the content
checked out of hg simply doesn't agree with what is checked out of CVS.

I would also like to remind Mercurial users that cvs2svn [3] has a
"cvs2hg" mode (please use the current SVN version) that outputs data
that can be loaded by Mercurial's fastimport extension.  I could use a
little help from the Mercurial community to make sure that the
conversion is "idiomatic" for the hg world.  But the reconstruction of
the CVS history is very robust, as it shares (literally, I just measured
it) 97% of its code with cvs2svn.  And it's even written in Python :-)

If it would help convince people to help develop and use cvs2hg instead
of working on Mercurial's CVS convert extension, I am confident that I
can quickly cook up some test cases that cause the hg convert extension
to break in ways that are nearly impossible to fix.  (I recently "helped
out" the git project this way [4,5].)


[2] http://cvs2svn.tigris.org/features.html
[3] http://cvs2svn.tigris.org/
[4] http://marc.info/?l=git&m=123536574807253 etc.
[5] http://marc.info/?l=git&m=123761411215255

More information about the Mercurial-devel mailing list