[PATCH 0 of 5] Implement O(n) algorithm for findrenames() [version 2]
David Greenaway
hg-dev at davidgreenaway.com
Sun Mar 21 02:33:25 CDT 2010
Hi all,
The current findrenames() algorithm requires O(n^2) time complexity
(where 'n' is the number of files needed to be added/removed). This is
problematic on large projects (for example, attempting to move
a repository from Linux 2.6.24 to 2.6.33 using 'hg addremove -s75'), as
the number of comparisons that need to take place starts reaching into
the tens-of-millions.
This series of patches modifies findrenames() to use an O(n) time
complexity algorithm, significantly improving its performance on larger
projects.
The algorithm used (inspired by Andrew Tridgell's "rsync" algorithm)
splits the source files into k-byte blocks, and places a 32-bit hash of
each block into a dictionary. We then see if any of the target files
contain any matching hashes from the source files. The more blocks
a particular target file has in common with a particular source file,
the more similar those files are considered. Full details of the
algorithm are below.
Patches should apply to the current crew head (#16b9aa398c28).
Changelog
---------
The following changes have been made since the first submission of
these patches two weeks ago:
- Patches have been re-ordered and re-mixed to something more
sensible (I hope).
- Patches modified to be more in line with the mercurial's coding
style.
- File contexts are now used to access file data. The previous
version of these patches attempted to introduce a new layer of
abstraction that was confusing.
- Support for the progress bar extension has been merged in.
- The kind reviews given by Benoit Boissinot and others have
(hopefully) been incorporated.
Benchmarks
----------
(Benchmarks run on the current crew head.)
addremove 100% similarity:
# rm -rf *; hg up -C; mv tests tests.new
# hg --time addremove -s100 --dry-run
before : real 176.350 secs (user 128.890+0.000 sys 47.430+0.000)
exact-match (2) : real 2.130 secs (user 1.890+0.000 sys 0.240+0.000)
partial-match (4) : real 2.120 secs (user 1.930+0.000 sys 0.190+0.000)
c-hash-code (5) : real 2.020 secs (user 1.740+0.000 sys 0.270+0.000)
addremove 75% similarity:
# rm -rf *; hg up -C; mv tests tests.new; \
for i in tests.new/*; do echo x >> $i; done
# hg --time addremove -s75 --dry-run
before : real 264.560 secs (user 215.130+0.000 sys 49.410+0.000)
exact-match (2) : real 218.710 secs (user 172.790+0.000 sys 45.870+0.000)
partial-match (4) : real 34.790 secs (user 34.010+0.000 sys 0.780+0.000)
c-hash-code (5) : real 10.300 secs (user 9.690+0.000 sys 0.590+0.000)
These improvements become more significant on larger projects.
Algorithm Details
-----------------
The algorithm we use here (inspired by Andrew Tridgell's "rsync"
algorithm) attempts to find "similar" files between a list of 'n' files
and a list of 'm' target files in O(n + m) time.
The algorithm works by splitting all of the removed files into k-byte
blocks and calculating a hash for each. (A file consisting of 'n' bytes
will have 'n / k' hashes). We then store each hash in a dictionary,
mapping to the filename of where the block came from:
+------------+------------+------------+-->
foo.txt: |To be, or not to be: that is the question> ...
+------------+------------+------------+-->
\ / \ / \ /
0xaabc30aa 0xbb9b0fbb 0xcc94bbcc
hashes: { 0xaabc30aa : "foo.txt",
0xbb9b0fbb : "foo.txt",
0xcc94bbcc : "foo.txt",
... }
For each file in the list of added files, we then calculate hashes for
each k-byte substring. (A file consisting of 'n' bytes will have 'n
- k + 1' hashes). We search the dictionary of source-file hashes and
track which source files we have common blocks with:
+------------+---------------+------------->
baz.txt: |To be, or not to be --- That is the questi> ...
+------------+---------------+------------->
\ / \ /
0xaabc30aa 0xcc94bbcc
Because we calculate the hash for every k-byte substring (and not just
at block boundaries, like we did in the first stage), we find all
matching blocks (even if the matches are not aligned to block
boundaries).
Once all matching blocks have been found, we can determine which source
file we have the most common with, and hence which we are most "similar"
to.
The algorithm, as stated, will misbehave when either the source or
target file contains duplicate blocks. We perform some trickery to avoid
such problems. Pathological cases such as large files filled with zeroes
should be handled correctly.
Final notes
-----------
Any comments would be much appreciated. If people are keen to introduce
this type of functionality to mercurial (and are happy with this
particular approach), I will be happy to continue to tweak these patches
based on any feedback.
Thanks so much,
David
More information about the Mercurial-devel
mailing list