[PATCH 0 of 5] RFC: Implement O(n + m) algorithm for findrenames()

Sun Mar 7 18:19:12 CST 2010

On Sun, Mar 07, 2010 at 06:01:06PM +0100, Mads Kiilerich wrote:
> David Greenaway wrote, On 03/07/2010 05:12 AM:
> >I maintain a mercurial repository that tracks my company's local changes
> >to the Linux sources. Attempting to upgrade from Linux 2.6.24 to the
> >latest mainline, I tried the following:
> >
> >     hg update -C linux-2.6.24-mainline
> >     rm -rf *
> >     tar -xzvf ~/linux-2.6.33.tar.gz
> >     hg addremove -s75
> 
> So all the files are "renamed" from linux-2.6.24/* to
> linux-2.6.33/*? That makes it a worst case (which indeed should be
> handled better), but it is also a case which Mercurial won't handle
> efficient in the storage anyway (so far), so it would be better to
> avoid such massive renamings.

No, that is just me being sloppy and not testing my commands before
posting, sorry. Files are extracted in the sensible way, without the
version number part of the path component.

A more precise testcase is:

    wget 'http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.24.tar.bz2'
    wget 'http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.33.tar.bz2'
    mkdir repository; cd repository
    hg init
    tar --strip-components=1 -xjvf ../linux-2.6.24.tar.bz2
    hg addremove
    hg commit -m A
    rm -rf *
    tar --strip-components=1 -xjvf ../linux-2.6.33.tar.bz2

Followed by:

    hg --time addremove --dry-run -s75

The "--dry-run" allows multiple benchmark runs and also removes the time
taken to record the moves in the revlog (which is independent of the
similarity algorithm).

> How many files are unchanged and indentical in 2.6.24 and 2.6.33?
> The linux code churn is impressive, but I think that very often in
> similar cases many of the files will be moved/renamed but not
> modified.

Linux 2.6.33 has 31565 files.

Of those:

    16400 were added since 2.6.24;
    7897 were removed from 2.6.24.

Of the files added and removed, patched mercurial detects:

    1609 being identical to a file in 2.6.24;
    3407 having similarity >= 75% to a file in 2.6.24.

> Would it make sense to make a first cheap scan only
> looking for 100% match? Perhaps by using the cheap "len" as a coarse
> hash value? How much improvement will this give in your example?

Patch 4 in the series attempts to  do this using SHA1 hashes.

Cheers,
David