speed up relink script

Tue Mar 20 10:32:20 CDT 2007

On Tue, Mar 20, 2007 at 06:11:26AM -0500, TK Soh wrote:
> On 3/19/07, Brendan Cully <brendan at kublai.com> wrote:
> >If you want to speed it up, you might try searching from the back to
> >the front (differences should show up faster that way), or perhaps
> >forking off md5sum for the candidate lists and comparing by that
> >(possibly hand-checking matches for md5 collisions). I can't convince
> >myself that it's safe to assume that a match in the last chunk is
> >sufficient.
> 
> Coming to think about it again, perhaps reading from back to front
> isn't going to help much either. Apart from the fact that it's will be
> slow to read that way as pointed out by Bryan, most files in the repos
> are likely to stay unchanged over time. So it may actually slow down
> the comparison.
> 
> I wonder if we can somehow compare the latest chunk of the index or
> data files checked in. Or, perhaps the last rev data in the index file
> will be representative? I'm not too confident on my understanding on
> the inner of hg to decide. Any input?

Some notes:

Two revlogs are identical if their indices are the same
Two revlogs don't match if they have different numbers of entries
Two revlogs are identical if their heads are the same.
Two revlogs may still be identical if their sizes are different, if
their last records are different, etc.

The first observations says we can avoid ever reading .d files.

This suggest the following approach:

For each .i file in repo A:
   record size, MD5 hash, number of entries, and sorted list of heads

For each .i file in repo B:
   if sizes match:
     if hashes match:
       relink files
   else:
     read index
     if counts don't match:
       continue
     find heads
     if heads match:
       relink files

-- 
Mathematics is the supreme nostalgia of our time.