[PATCH 0 of 5] Implement O(n) algorithm for findrenames() [version 2]

Sun Mar 21 02:33:25 CDT 2010

Hi all,

The current findrenames() algorithm requires O(n^2) time complexity
(where 'n' is the number of files needed to be added/removed). This is
problematic on large projects (for example, attempting to move
a repository from Linux 2.6.24 to 2.6.33 using 'hg addremove -s75'), as
the number of comparisons that need to take place starts reaching into
the tens-of-millions.

This series of patches modifies findrenames() to use an O(n) time
complexity algorithm, significantly improving its performance on larger
projects.

The algorithm used (inspired by Andrew Tridgell's "rsync" algorithm)
splits the source files into k-byte blocks, and places a 32-bit hash of
each block into a dictionary. We then see if any of the target files
contain any matching hashes from the source files. The more blocks
a particular target file has in common with a particular source file,
the more similar those files are considered. Full details of the
algorithm are below.

Patches should apply to the current crew head (#16b9aa398c28).

Changelog
---------

The following changes have been made since the first submission of
these patches two weeks ago:

    - Patches have been re-ordered and re-mixed to something more
      sensible (I hope).

    - Patches modified to be more in line with the mercurial's coding
      style.

    - File contexts are now used to access file data. The previous
      version of these patches attempted to introduce a new layer of
      abstraction that was confusing.

    - Support for the progress bar extension has been merged in.

    - The kind reviews given by Benoit Boissinot and others have
      (hopefully) been incorporated.

Benchmarks
----------

(Benchmarks run on the current crew head.)

addremove 100% similarity:

  # rm -rf *; hg up -C; mv tests tests.new
  # hg --time addremove -s100 --dry-run

  before             : real 176.350 secs (user 128.890+0.000 sys 47.430+0.000)
  exact-match   (2)  : real   2.130 secs (user   1.890+0.000 sys  0.240+0.000)
  partial-match (4)  : real   2.120 secs (user   1.930+0.000 sys  0.190+0.000)
  c-hash-code   (5)  : real   2.020 secs (user   1.740+0.000 sys  0.270+0.000)

addremove 75% similarity:

  # rm -rf *; hg up -C; mv tests tests.new; \
      for i in tests.new/*; do echo x >> $i; done
  # hg --time addremove -s75  --dry-run

  before             : real 264.560 secs (user 215.130+0.000 sys 49.410+0.000)
  exact-match   (2)  : real 218.710 secs (user 172.790+0.000 sys 45.870+0.000)
  partial-match (4)  : real  34.790 secs (user  34.010+0.000 sys  0.780+0.000)
  c-hash-code   (5)  : real  10.300 secs (user   9.690+0.000 sys  0.590+0.000)

These improvements become more significant on larger projects.

Algorithm Details
-----------------

The algorithm we use here (inspired by Andrew Tridgell's "rsync"
algorithm) attempts to find "similar" files between a list of 'n' files
and a list of 'm' target files in O(n + m) time.

The algorithm works by splitting all of the removed files into k-byte
blocks and calculating a hash for each. (A file consisting of 'n' bytes
will have 'n / k' hashes). We then store each hash in a dictionary,
mapping to the filename of where the block came from:

            +------------+------------+------------+-->
   foo.txt: |To be, or not to be: that is the question> ...
            +------------+------------+------------+-->
             \          / \          / \          /
              0xaabc30aa   0xbb9b0fbb   0xcc94bbcc

   hashes:  { 0xaabc30aa : "foo.txt",
              0xbb9b0fbb : "foo.txt",
              0xcc94bbcc : "foo.txt",
              ... }

For each file in the list of added files, we then calculate hashes for
each k-byte substring. (A file consisting of 'n' bytes will have 'n
- k + 1' hashes). We search the dictionary of source-file hashes and
track which source files we have common blocks with:

            +------------+---------------+------------->
   baz.txt: |To be, or not to be --- That is the questi> ...
            +------------+---------------+------------->
             \          /                 \          /
              0xaabc30aa                   0xcc94bbcc

Because we calculate the hash for every k-byte substring (and not just
at block boundaries, like we did in the first stage), we find all
matching blocks (even if the matches are not aligned to block
boundaries).

Once all matching blocks have been found, we can determine which source
file we have the most common with, and hence which we are most "similar"
to.

The algorithm, as stated, will misbehave when either the source or
target file contains duplicate blocks. We perform some trickery to avoid
such problems. Pathological cases such as large files filled with zeroes
should be handled correctly.

Final notes
-----------

Any comments would be much appreciated. If people are keen to introduce
this type of functionality to mercurial (and are happy with this
particular approach), I will be happy to continue to tweak these patches
based on any feedback.

Thanks so much,
David