[PATCH 2 of 5] findrenames: Replace current similarity detection algorithm with an O(n + m) algorithm

David Greenaway hg-dev at davidgreenaway.com
Sun Mar 7 18:22:19 CST 2010


On Sun, Mar 07, 2010 at 07:10:54PM +0100, Benoit Boissinot wrote:
> On Sun, Mar 07, 2010 at 04:12:49AM -0000, David Greenaway wrote:
> > # HG changeset patch
> > # User David Greenaway <hg-dev at davidgreenaway.com>
> > # Date 1267934964 -39600
> > # Node ID 838395f8b30b2077d387c87aa5f0408f3071a345
> > # Parent  10649eca0e852b7f229e392f36812bbd6f89773c
> > findrenames: Replace current similarity detection algorithm with an O(n + m) algorithm.
[...]

> Quick review below.
> 
> > diff --git a/mercurial/hash.py b/mercurial/hash.py
> > new file mode 100644
> > --- /dev/null
> > +++ b/mercurial/hash.py
> > @@ -0,0 +1,80 @@
> > +# hash.py - misc hashing functionality
> > +#
> > +# Copyright 2010 David Greenaway <hg-dev at davidgreenaway.com>
> > +#
> > +# This software may be used and distributed according to the terms of the
> > +# GNU General Public License version 2, incorporated herein by reference.
> 
> GPLv2+

Ack. My bad for starting development on old sources, and merging up.

[...]
> > +    #
> > +    # There are a few catches:
> > +    #
> > +    # 1. We assume that if hash(x) == hash(y), then x == y.
> > +    #    This assumption will not always be true. We rely on false
> > +    #    matches being uncommon. Two unrelated files having one
> > +    #    or two collisions should not matter; if unrelated files
> > +    #    have many false hashes, incorrect matchings may ensue.
> 
> If you keep the sha1 check from the earlier findrename version, this
> won't trigger.

This part of the algorithm needs to use 32-bit hashes, which are pretty
much guaranteed to collide at some point for larger repositories. Files
are only detected as matches together if they have a high number of
matches, though, so single collisions do not matter too much.

Patch 4 re-introduces SHA1 hashes for finding exact matches.


All your other comments are well pointed out. Thank you for the review.

Cheers,
David



More information about the Mercurial-devel mailing list