[PATCH 0 of 5] RFC: Implement O(n + m) algorithm for findrenames()
Augie Fackler
lists at durin42.com
Sat Mar 6 23:03:57 CST 2010
OOC, have you seen http://bitbucket.org/cbarrett/guess-renames/?
Specifically, Aaron Bentley's algorithm for this?
http://osdir.com/ml/bazaar/2009-03/msg01300.html
Everything I've heard about it suggests it does very well.
On Sat, Mar 6, 2010 at 10:12 PM, David Greenaway
<hg-dev at davidgreenaway.com> wrote:
>
> Hi all,
>
> I maintain a mercurial repository that tracks my company's local changes
> to the Linux sources. Attempting to upgrade from Linux 2.6.24 to the
> latest mainline, I tried the following:
>
> hg update -C linux-2.6.24-mainline
> rm -rf *
> tar -xzvf ~/linux-2.6.33.tar.gz
> hg addremove -s75
>
> After leaving that last hg command for twelve hours on my 2.4GHz Core II
> Duo, I decided that it was probably time to give up.
>
> The problem is the algorithm used in findrenames() requires
> approximately O(m * n) file comparisons, making it infeasable to use
> once 'm' and 'n' become large.
>
> This series of patches changes the findrenames() algorithm causing it to
> use something closer to O(m + n) file comparisons. After all patches are
> applied, the above sequence of commands successfully complete on my
> wimpy 1.2GHz Atom box in less than 10 minutes.
>
> The algorithm used (inspired by Andrew Tridgell's "rsync" algorithm)
> splits the source files into k-byte blocks, and places a 32-bit hash of
> each block into a dictionary. We then see if any of the target files
> contain any matching hashes from the source files. The more blocks
> a particular target file has in common with a particular source file,
> the more similar those files are considered. Full details of the
> algorithm are below.
>
> There are a few caveats:
>
> * These patches redefine the meaning of "similarity" between two
> files. Instead of being based on the length of the maximum common
> subsequence of two files, it is based on the number of common
> blocks two files share.
>
> The former will claim a file with two large functions is 50%
> similar to a file with those two large functions in reverse order,
> while the latter will claim it is close to 100% similar.
>
> The former will claim a shuffled dictionary is 50% similar to an
> un-shuffled dictionary, while the latter will claim it is 0%
> similar (unless the words are quite large).
>
> The former tends to work better with small files, while the latter
> tends to work better with binary files.
>
> * The new algorithm can use large amounts of RAM in the worst case.
> I suspect this will only tend to occur when the former algorithm
> would have taken too long anyway; it effectively changes waiting
> forever into swap death.
>
> * The amount of code used for the new algorithm is significantly
> larger than that of the old. It may not be worth having so much
> code to optimise a small part of mercurial's functionality.
>
> I would be keen to know if anyone would be interested in incorporating
> such changes in mercurial, and, if so, what changes would need to be
> made to the following set of patches before they would be accepted.
>
> Benchmarks
> ----------
>
> Similar matches (using crew repository):
>
> rm -rf *; hg up -C; for i in `find . -name "*.py"` ; do mv $i $i.new; echo x >> $i.new; done
> hg --time addremove -s 75
>
> before: Time: real 54.740 secs (user 50.020+0.000 sys 4.440+0.000)
> python-hash: Time: real 42.930 secs (user 42.440+0.000 sys 0.470+0.000)
> c-hash: Time: real 13.080 secs (user 12.630+0.000 sys 0.420+0.000)
> exact-match: Time: real 13.880 secs (user 13.400+0.000 sys 0.450+0.000)
>
> Exact matches (using crew repository):
>
> rm -rf *; hg up -C; for i in `find . -name "*.py"` ; do mv $i $i.new; done
> hg --time addremove -s100
>
> before: Time: real 16.730 secs (user 13.560+0.000 sys 3.030+0.000)
> python-hash: Time: real 43.370 secs (user 42.840+0.000 sys 0.530+0.000)
> c-hash: Time: real 13.040 secs (user 12.630+0.000 sys 0.390+0.000)
> exact-match: Time: real 4.580 secs (user 4.010+0.000 sys 0.450+0.000)
>
> For a medium-sized repository such as "crew" (with 300 renames) these
> patches give a modest 4x speed increase. The performance win becomes
> more convincing as the number of (potential) renames increases.
>
> Patches
> -------
>
> The changes are split into five patches:
>
> 1: Removes repository-reading code out of findrenames() back into
> addremove(), allowing findrename() to be agnostic about where the
> data is used comes from.
>
> 2: Implements the above partial match-by-hash algorithm in pure Python.
>
> 3: Add an optimised C implementation of the hashing functions used by
> the above algorithm, giving a significant performance boost to the
> algorithm.
>
> Please review carefully, as I don't have any experience writing
> Python modules in C.
>
> 4: Optimises findrenames() for the case when the desired similarity is
> 100% by using Python's SHA1 hash functionality for finding duplicate
> files.
>
> 5: Updates the mercurial unit tests.
>
> Algorithm Details
> -----------------
>
> The details of the algorithm are as follows:
>
> We split all of the source files into k-byte blocks and calculate
> a hash for each. (A file consisting of 'n' bytes will have 'n / k'
> hashes). We then store the hash in a dictionary, mapping to the
> filename of where the block came from:
>
> +------------+------------+------------+-->
> foo.txt: |To be, or not to be: that is the question> ...
> +------------+------------+------------+-->
> \ / \ / \ /
> 0xaabc30aa 0xbb9b0fbb 0xcc94bbcc
>
> hashes: { 0xaabc30aa : "foo.txt",
> 0xbb9b0fbb : "foo.txt",
> 0xcc94bbcc : "foo.txt",
> ... }
>
> For each file in the target list, we then calculate hashes for
> each k-byte substring. (A file consisting of 'n' bytes will have
> 'n - k + 1' hashes). We search the dictionary of source-file
> hashes and track which source files we have common blocks with:
>
> +------------+---------------+------------->
> baz.txt: |To be, or not to be --- That is the questi> ...
> +------------+---------------+------------->
> \ / \ /
> 0xaabc30aa 0xcc94bbcc
>
> Because we calculate the hash for every k-byte substring (and not
> just at block boundaries, like we did in the first stage), we
> still find matching blocks, even if the matches are not aligned
> to block boundaries.
>
> Once all matching blocks have been found, we can determine which
> source file we have the most common with, and hence which we are
> most "similar" to.
>
> There are a few catches:
>
> 1. We assume that if hash(x) == hash(y), then x == y.
> This assumption will not always be true. We rely on false
> matches being uncommon. Two unrelated files having one
> or two collisions should not matter; if unrelated files
> have many false hashes, incorrect matchings may ensue.
>
> 2. Permutations of blocks will not be detected. We may have two files
> that are "100% similar" which are not in fact the same.
>
> 3. The algorithm, as stated, will misbehave when either the source
> or target file contains duplicate blocks. We perform some trickery to
> avoid such problems. Pathological cases such as large files filled
> with zeroes should be handled correctly.
>
> Final Notes
> -----------
>
> Sorry for the long post :)
>
> Cheers,
> David
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
More information about the Mercurial-devel
mailing list