[PATCH] Built-in cvsps for hg cvsimport

Sat Apr 5 00:22:55 CDT 2008

Matt Mackall wrote:
> On Fri, 2008-04-04 at 12:12 +0200, Michael Haggerty wrote:
>> Frank Kingswood wrote:
>>> The built-in cvsps uses cvs rlog on the repository (it does not do
>>> direct cvs server calls, it runs the cvs executable), sorts the commit
>>> log messages, and merges commits with identical messages, author and
>>> branch name and a date within the 60-seconds fuzz window.
>>>
>>> This builtin cvsps code has been found to work succesfully in cases
>>> where the traditional external cvsps program generates incorrect
>>> changesets.
>> Please just be aware that the algorithm that you describe is still not
>> nearly robust enough to handle nontrivial CVS repositories.  There are
>> lots of strange things that can happen in a CVS repository history that
>> confuse cvsps and will confuse your algorithm as well.
>>
>> Trying to make a cvsps-like algorithm more robust is the way to madness,
>> because the fundamental algorithm is far too naive.  I learned that
>> lesson very painfully because that was the approach of cvs2svn 1.x,
>> before I completely rewrite the changeset-deduction code for cvs2svn 2.x
> 
> We all know that cvsps is crap. But it does make a lot of tasks (like
> writing a custom incremental converter) much easier. If there were a
> drop-in replacement for cvsps that did a better job, the world would be
> a better place and the old cvsps would die overnight (hint hint).

I considered that [1], but the output format of cvsps is also too naive.
 For example, it has no place to put the additional information that
would be needed to reconstruct branches and tags with the correct
contents.  It's like George Orwell said: limiting the language that
cvsps speaks limits the complexity of its thoughts.  And since the
cvsps-based importers understand the same limited language, they have no
idea of the seditious things that can happen in a real-life CVS repository.

> Frank's tool is certainly a step in that direction as it's at least an
> order of magnitude less code than the original. So it'll probably be an
> order of magnitude less work to get it to do the right thing.

My point wasn't to criticize Frank's embedded cvsps, which I assume is
an improvement on the original cvsps.  I only wanted to warn against
investing too much effort in an approach that is IMHO doomed to lose
data in many common circumstances.  And the lesson from cvs2svn 1.x is
that cvsps's algorithm cannot be incrementally improved into a robust
tool; a new approach is needed.

OTOH, I understand the demand for incremental conversion and the
willingness of people to sacrifice even accuracy to get it.  If anybody
wants to work on adding incrementalism to cvs2svn, I would be happy to
help.  One could have robustness and incrementalism at the same time and
make cvsps-based conversion tools obsolete.  It's not an easy job, but
the cvs2svn code base is clean and already uses databases to store
intermediate results (these could be the basis of the checkpointing system).

Michael

[1] http://marc.info/?l=mercurial-devel&m=120363111232106&w=2