[PATCH 0 of 5] Implement O(n) algorithm for findrenames() [version 2]

Mads Kiilerich mads at kiilerich.com
Sun Mar 21 15:07:53 CDT 2010


David Greenaway wrote, On 03/21/2010 08:33 AM:
> Hi all,
>
> The current findrenames() algorithm requires O(n^2) time complexity
> (where 'n' is the number of files needed to be added/removed). This is
> problematic on large projects (for example, attempting to move
> a repository from Linux 2.6.24 to 2.6.33 using 'hg addremove -s75'), as
> the number of comparisons that need to take place starts reaching into
> the tens-of-millions.
>
> This series of patches modifies findrenames() to use an O(n) time
> complexity algorithm, significantly improving its performance on larger
> projects.
>    

Nice work. Thanks!

> The algorithm used (inspired by Andrew Tridgell's "rsync" algorithm)
> splits the source files into k-byte blocks,

"k-byte" refers to _BLOCK_SIZE = 64?

> and places a 32-bit hash of
> each block into a dictionary. We then see if any of the target files
> contain any matching hashes from the source files. The more blocks
> a particular target file has in common with a particular source file,
> the more similar those files are considered. Full details of the
> algorithm are below.
>    

So if one extra character is inserted in the beginning of a file none of 
the hashes will match? There is no rolling checksum?

Assuming that my assumptions above can be confirmed I have some 
comments/suggestions/questions which might usable - and if not they are 
ignorable:

Instead of storing the raw 64 bytes we store 4 bytes? That is "just" an 
optimization of one order of magnitude better than the naïve approach. 
Is that good enough? Shouldn't we aim higher, towards something that 
scales better? Or do other existing bottlenecks make this a non-issue?

I can see how block-based algorithm will be useful for files like 
database files or disk images. But Mercurial is mostly made and 
optimized for source code - though often "abused" for "binary" files 
like executables or compressed files (and probably less often for 
database files and disk images). Is this attempt at matching blocks 
inside the file really a relevant optimization? How often in your test 
case do you get any relevant matches after one block has failed? 
Wouldn't it be just as efficient (and perhaps better?) to store hashes 
for for example the first and last 128, 256, 512, ... bytes and just 
look for files with as much as possible in common in the beginning and end?

IIRC the Mercurial differential storage format splits on \n - even if it 
is "binary" files. Perhaps it would be more efficient (at least in the 
common Mercurial use cases) if the hashes worked on \n-delimited blocks 
instead of fixed size? (Short blocks could perhaps be skipped.)

Isn't it a valid assumption/optimization that for moved "binary" files 
we only care about 100% matches? Shouldn't all util.binary files be 
ignored in the expensive similarity-detection?

"Synchronizing" on \n would give less need for rolling checksum - but 
wouldn't it help a lot anyway to avoid false matches? Perhaps just 
something simple like not using a block hash directly but always xor it 
with the 3 (and/or 30?) previous hashes. Wouldn't that also mitigate the 
"duplicate block" issue a bit?

> Benchmarks
> ----------
>    

Perhaps it would be relevant to include memory consumption in the 
measurements. (See contrib/memory.py if you are on Linux.)

/Mads



More information about the Mercurial-devel mailing list