[PATCH 0 of 5] RFC: Implement O(n + m) algorithm for findrenames()

Sat Mar 6 23:03:57 CST 2010

OOC, have you seen http://bitbucket.org/cbarrett/guess-renames/?
Specifically, Aaron Bentley's algorithm for this?
http://osdir.com/ml/bazaar/2009-03/msg01300.html

Everything I've heard about it suggests it does very well.

On Sat, Mar 6, 2010 at 10:12 PM, David Greenaway
<hg-dev at davidgreenaway.com> wrote:
>
> Hi all,
>
> I maintain a mercurial repository that tracks my company's local changes
> to the Linux sources. Attempting to upgrade from Linux 2.6.24 to the
> latest mainline, I tried the following:
>
>    hg update -C linux-2.6.24-mainline
>    rm -rf *
>    tar -xzvf ~/linux-2.6.33.tar.gz
>    hg addremove -s75
>
> After leaving that last hg command for twelve hours on my 2.4GHz Core II
> Duo, I decided that it was probably time to give up.
>
> The problem is the algorithm used in findrenames() requires
> approximately O(m * n) file comparisons, making it infeasable to use
> once 'm' and 'n' become large.
>
> This series of patches changes the findrenames() algorithm causing it to
> use something closer to O(m + n) file comparisons. After all patches are
> applied, the above sequence of commands successfully complete on my
> wimpy 1.2GHz Atom box in less than 10 minutes.
>
> The algorithm used (inspired by Andrew Tridgell's "rsync" algorithm)
> splits the source files into k-byte blocks, and places a 32-bit hash of
> each block into a dictionary. We then see if any of the target files
> contain any matching hashes from the source files. The more blocks
> a particular target file has in common with a particular source file,
> the more similar those files are considered. Full details of the
> algorithm are below.
>
> There are a few caveats:
>
>    * These patches redefine the meaning of "similarity" between two
>      files. Instead of being based on the length of the maximum common
>      subsequence of two files, it is based on the number of common
>      blocks two files share.
>
>      The former will claim a file with two large functions is 50%
>      similar to a file with those two large functions in reverse order,
>      while the latter will claim it is close to 100% similar.
>
>      The former will claim a shuffled dictionary is 50% similar to an
>      un-shuffled dictionary, while the latter will claim it is 0%
>      similar (unless the words are quite large).
>
>      The former tends to work better with small files, while the latter
>      tends to work better with binary files.
>
>    * The new algorithm can use large amounts of RAM in the worst case.
>      I suspect this will only tend to occur when the former algorithm
>      would have taken too long anyway; it effectively changes waiting
>      forever into swap death.
>
>    * The amount of code used for the new algorithm is significantly
>      larger than that of the old. It may not be worth having so much
>      code to optimise a small part of mercurial's functionality.
>
> I would be keen to know if anyone would be interested in incorporating
> such changes in mercurial, and, if so, what changes would need to be
> made to the following set of patches before they would be accepted.
>
> Benchmarks
> ----------
>
> Similar matches (using crew repository):
>
>    rm -rf *; hg up -C; for i in `find . -name "*.py"` ; do mv $i $i.new; echo x >> $i.new; done
>    hg --time addremove -s 75
>
>    before:      Time: real 54.740 secs (user 50.020+0.000 sys 4.440+0.000)
>    python-hash: Time: real 42.930 secs (user 42.440+0.000 sys 0.470+0.000)
>    c-hash:      Time: real 13.080 secs (user 12.630+0.000 sys 0.420+0.000)
>    exact-match: Time: real 13.880 secs (user 13.400+0.000 sys 0.450+0.000)
>
> Exact matches (using crew repository):
>
>    rm -rf *; hg up -C; for i in `find . -name "*.py"` ; do mv $i $i.new; done
>    hg --time addremove -s100
>
>    before:      Time: real 16.730 secs (user 13.560+0.000 sys 3.030+0.000)
>    python-hash: Time: real 43.370 secs (user 42.840+0.000 sys 0.530+0.000)
>    c-hash:      Time: real 13.040 secs (user 12.630+0.000 sys 0.390+0.000)
>    exact-match: Time: real 4.580 secs (user 4.010+0.000 sys 0.450+0.000)
>
> For a medium-sized repository such as "crew" (with 300 renames) these
> patches give a modest 4x speed increase. The performance win becomes
> more convincing as the number of (potential) renames increases.
>
> Patches
> -------
>
> The changes are split into five patches:
>
> 1: Removes repository-reading code out of findrenames() back into
>   addremove(), allowing findrename() to be agnostic about where the
>   data is used comes from.
>
> 2: Implements the above partial match-by-hash algorithm in pure Python.
>
> 3: Add an optimised C implementation of the hashing functions used by
>   the above algorithm, giving a significant performance boost to the
>   algorithm.
>
>   Please review carefully, as I don't have any experience writing
>   Python modules in C.
>
> 4: Optimises findrenames() for the case when the desired similarity is
>   100% by using Python's SHA1 hash functionality for finding duplicate
>   files.
>
> 5: Updates the mercurial unit tests.
>
> Algorithm Details
> -----------------
>
> The details of the algorithm are as follows:
>
> We split all of the source files into k-byte blocks and calculate
> a hash for each. (A file consisting of 'n' bytes will have 'n / k'
> hashes). We then store the hash in a dictionary, mapping to the
> filename of where the block came from:
>
>            +------------+------------+------------+-->
>   foo.txt: |To be, or not to be: that is the question> ...
>            +------------+------------+------------+-->
>             \          / \          / \          /
>              0xaabc30aa   0xbb9b0fbb   0xcc94bbcc
>
>   hashes:  { 0xaabc30aa : "foo.txt",
>              0xbb9b0fbb : "foo.txt",
>              0xcc94bbcc : "foo.txt",
>              ... }
>
> For each file in the target list, we then calculate hashes for
> each k-byte substring. (A file consisting of 'n' bytes will have
> 'n - k + 1' hashes). We search the dictionary of source-file
> hashes and track which source files we have common blocks with:
>
>            +------------+---------------+------------->
>   baz.txt: |To be, or not to be --- That is the questi> ...
>            +------------+---------------+------------->
>             \          /                 \          /
>              0xaabc30aa                   0xcc94bbcc
>
> Because we calculate the hash for every k-byte substring (and not
> just at block boundaries, like we did in the first stage), we
> still find matching blocks, even if the matches are not aligned
> to block boundaries.
>
> Once all matching blocks have been found, we can determine which
> source file we have the most common with, and hence which we are
> most "similar" to.
>
> There are a few catches:
>
> 1. We assume that if hash(x) == hash(y), then x == y.
>   This assumption will not always be true. We rely on false
>   matches being uncommon. Two unrelated files having one
>   or two collisions should not matter; if unrelated files
>   have many false hashes, incorrect matchings may ensue.
>
> 2. Permutations of blocks will not be detected. We may have two files
>   that are "100% similar" which are not in fact the same.
>
> 3. The algorithm, as stated, will misbehave when either the source
>   or target file contains duplicate blocks. We perform some trickery to
>   avoid such problems. Pathological cases such as large files filled
>   with zeroes should be handled correctly.
>
> Final Notes
> -----------
>
> Sorry for the long post :)
>
> Cheers,
> David
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>