CVS conversion improvement

Lloyd Parkes lloyd at must-have-coffee.gen.nz
Sun Oct 7 16:51:50 CDT 2012


Hi all,
I want to make some changes to the way "hg convert" handles CVS log messages and I would like some feedback on possible ways to improve the parsing of CVS log messages. 

The root problem is that CVS log messages are free form text, so it's not possible to parse a stream of them in the way the cvsps.py tries to because you can never guarantee that you have correctly determined where the boundary between log messages is. By definition, any syntactic element you choose for log message separators can appear inside a log message. The current code does well and it beats the IT industry's holy grail of five nines by a long way, but I need perfect. I have 10GB of data in half a dozen CVS repositories and an inconvenient number of log messages contain copies of logs that have been cut and paste from other branches.

The only solution that I can see to this is to request each log message individually, which means we effectively use end-of-file as the boundary between log messages rather than some syntactic element. I can see two ways to implement this.

1) Change cvsps.py so that it only asks CVS for the log headers and as it finds each revision of each file in the stream of log headers, it uses a second popen to extract the log message. That's a lot of extra popens and so I thought of using the "cvs server" mechanism from cvs.py to get the log messages within cvsps.py. That made me think of a second option.

2) Don't read log messages in cvsps.py, instead read the log messages in cvs.py when the revision itself is being read. i.e. read both types of free form data at the same point in the conversion. We could reuse the CVS server connection and it shouldn't be much extra code. We would have to change some of the UI code though because it currently prints out bits of the log messages as progress information. That could be changed to something like file names and revision numbers, which doesn't quite have the same human feel-good factor but I expect it'll be fine. I don't know if the log messages are used anywhere though. Maybe to help detect patch sets? 

What are people's opinions on these two ideas? 1) seems to have more localised changes (and therefore easier for me to implement), but 2) looks as if it may perform much better.

Cheers,
Lloyd



More information about the Mercurial-devel mailing list