[PATCH 0 of 5] RFC: Implement O(n + m) algorithm for findrenames()

Sun Mar 7 04:12:47 UTC 2010

Hi all,

I maintain a mercurial repository that tracks my company's local changes
to the Linux sources. Attempting to upgrade from Linux 2.6.24 to the
latest mainline, I tried the following:

    hg update -C linux-2.6.24-mainline
    rm -rf *
    tar -xzvf ~/linux-2.6.33.tar.gz
    hg addremove -s75

After leaving that last hg command for twelve hours on my 2.4GHz Core II
Duo, I decided that it was probably time to give up.

The problem is the algorithm used in findrenames() requires
approximately O(m * n) file comparisons, making it infeasable to use
once 'm' and 'n' become large.

This series of patches changes the findrenames() algorithm causing it to
use something closer to O(m + n) file comparisons. After all patches are
applied, the above sequence of commands successfully complete on my
wimpy 1.2GHz Atom box in less than 10 minutes.

The algorithm used (inspired by Andrew Tridgell's "rsync" algorithm)
splits the source files into k-byte blocks, and places a 32-bit hash of
each block into a dictionary. We then see if any of the target files
contain any matching hashes from the source files. The more blocks
a particular target file has in common with a particular source file,
the more similar those files are considered. Full details of the
algorithm are below.

There are a few caveats:

    * These patches redefine the meaning of "similarity" between two
      files. Instead of being based on the length of the maximum common
      subsequence of two files, it is based on the number of common
      blocks two files share.

      The former will claim a file with two large functions is 50%
      similar to a file with those two large functions in reverse order,
      while the latter will claim it is close to 100% similar.

      The former will claim a shuffled dictionary is 50% similar to an
      un-shuffled dictionary, while the latter will claim it is 0%
      similar (unless the words are quite large).

      The former tends to work better with small files, while the latter
      tends to work better with binary files.

    * The new algorithm can use large amounts of RAM in the worst case.
      I suspect this will only tend to occur when the former algorithm
      would have taken too long anyway; it effectively changes waiting
      forever into swap death.

    * The amount of code used for the new algorithm is significantly
      larger than that of the old. It may not be worth having so much
      code to optimise a small part of mercurial's functionality.

I would be keen to know if anyone would be interested in incorporating
such changes in mercurial, and, if so, what changes would need to be
made to the following set of patches before they would be accepted.

Benchmarks
----------

Similar matches (using crew repository):

    rm -rf *; hg up -C; for i in `find . -name "*.py"` ; do mv $i $i.new; echo x >> $i.new; done
    hg --time addremove -s 75

    before:      Time: real 54.740 secs (user 50.020+0.000 sys 4.440+0.000)
    python-hash: Time: real 42.930 secs (user 42.440+0.000 sys 0.470+0.000)
    c-hash:      Time: real 13.080 secs (user 12.630+0.000 sys 0.420+0.000)
    exact-match: Time: real 13.880 secs (user 13.400+0.000 sys 0.450+0.000)

Exact matches (using crew repository):

    rm -rf *; hg up -C; for i in `find . -name "*.py"` ; do mv $i $i.new; done
    hg --time addremove -s100

    before:      Time: real 16.730 secs (user 13.560+0.000 sys 3.030+0.000)
    python-hash: Time: real 43.370 secs (user 42.840+0.000 sys 0.530+0.000)
    c-hash:      Time: real 13.040 secs (user 12.630+0.000 sys 0.390+0.000)
    exact-match: Time: real 4.580 secs (user 4.010+0.000 sys 0.450+0.000)

For a medium-sized repository such as "crew" (with 300 renames) these
patches give a modest 4x speed increase. The performance win becomes
more convincing as the number of (potential) renames increases.

Patches
-------

The changes are split into five patches:

1: Removes repository-reading code out of findrenames() back into
   addremove(), allowing findrename() to be agnostic about where the
   data is used comes from.

2: Implements the above partial match-by-hash algorithm in pure Python.

3: Add an optimised C implementation of the hashing functions used by
   the above algorithm, giving a significant performance boost to the
   algorithm.

   Please review carefully, as I don't have any experience writing
   Python modules in C.

4: Optimises findrenames() for the case when the desired similarity is
   100% by using Python's SHA1 hash functionality for finding duplicate
   files.

5: Updates the mercurial unit tests.

Algorithm Details
-----------------

The details of the algorithm are as follows:

We split all of the source files into k-byte blocks and calculate
a hash for each. (A file consisting of 'n' bytes will have 'n / k'
hashes). We then store the hash in a dictionary, mapping to the
filename of where the block came from:

            +------------+------------+------------+-->
   foo.txt: |To be, or not to be: that is the question> ...
            +------------+------------+------------+-->
             \          / \          / \          /
              0xaabc30aa   0xbb9b0fbb   0xcc94bbcc

   hashes:  { 0xaabc30aa : "foo.txt",
              0xbb9b0fbb : "foo.txt",
              0xcc94bbcc : "foo.txt",
              ... }

For each file in the target list, we then calculate hashes for
each k-byte substring. (A file consisting of 'n' bytes will have
'n - k + 1' hashes). We search the dictionary of source-file
hashes and track which source files we have common blocks with:

            +------------+---------------+------------->
   baz.txt: |To be, or not to be --- That is the questi> ...
            +------------+---------------+------------->
             \          /                 \          /
              0xaabc30aa                   0xcc94bbcc

Because we calculate the hash for every k-byte substring (and not
just at block boundaries, like we did in the first stage), we
still find matching blocks, even if the matches are not aligned
to block boundaries.

Once all matching blocks have been found, we can determine which
source file we have the most common with, and hence which we are
most "similar" to.

There are a few catches:

1. We assume that if hash(x) == hash(y), then x == y.
   This assumption will not always be true. We rely on false
   matches being uncommon. Two unrelated files having one
   or two collisions should not matter; if unrelated files
   have many false hashes, incorrect matchings may ensue.

2. Permutations of blocks will not be detected. We may have two files
   that are "100% similar" which are not in fact the same.

3. The algorithm, as stated, will misbehave when either the source
   or target file contains duplicate blocks. We perform some trickery to
   avoid such problems. Pathological cases such as large files filled
   with zeroes should be handled correctly.

Final Notes
-----------

Sorry for the long post :)

Cheers,
David