[PATCH 0 of 5] Implement O(n) algorithm for findrenames() [version 2]

Sun Mar 21 21:00:01 CDT 2010

Benoit Boissinot wrote:
> Did you have the time to look at the earlier findrenames() attempt? If
> yes could you outline the differences?

As far as I understand, Herbert Griebel's algorithm worked by
statistically analysing files (by looking at sizes and histograms of the
bytes in the file) prior to comparing them. If, after looking at the
statistics, it could be determined that the two files had no change of
being similar, an expensive "diff" operation would be avoided on the
files. Otherwise, the diff operation would take place. Herbert's
algorithm still involved comparing every pair of files (hence was O(n^2)
time complexity), though each comparison was _much_ cheaper.

The proposed algorithm breaks source files into k-byte blocks
(with 'k' == 64 in the patch). It will then search for matching blocks
in the destination files. Each time a match is found between
a destination block and a source block, it is recorded. Once all
matching blocks have been found, the source file with the most matches
is considered the "winner". In the best case, the algorithm requires
only O(n) work. In the worst case, (where every file has the same
64-byte blocks), it also increases to O(n^2) time complexity (also with
a small constant).

So, in summary:

    * In pathological cases, both algorithms are O(n^2) time complexity.

    * In the best case, Herbert's algorithm is O(n^2) with a small
      constant, while the proposed algorithm is O(n).

    * Herbert's algorithm (like the existing algorithm) defines two
      files as being "similar" when they have a long subsequence of
      common lines. The proposed algorithm defines two files as
      "similar" when they have several common sequences of bytes. The
      former definition is more useful for short files. The latter
      definition is more useful for files where the contents have been
      rearranged and for binary files.

Cheers,
David