[PATCH 2 of 5] findrenames: Optimise "addremove -s100" by matching files by their SHA1 hashes

David Greenaway hg-dev at davidgreenaway.com
Sun Mar 21 21:50:13 CDT 2010


Mads Kiilerich wrote:
> David Greenaway wrote, On 03/21/2010 08:33 AM:
> ># HG changeset patch
> ># User David Greenaway<hg-dev at davidgreenaway.com>
> ># Date 1269151891 -39600
> ># Node ID 153c602f7b20c95484033d2b91d05171f62f91b2
> ># Parent  8bf2c8f1b35705f4f5b615aaf954a3cb034c5dc0
> >findrenames: Optimise "addremove -s100" by matching files by their SHA1 hashes.
> ...
> >+def findrenames(repo, added, removed, threshold):
> >+    '''find renamed files -- yields (before, after, score) tuples'''
> >+    commitedctx = repo['.']
> >+    workingctx = repo[None]
> >
> >+    # Fetch contexts for added and removed files.
> >+    addedfiles = [workingctx[fp] for fp in added]
> >+    removedfiles = [commitedctx[fp] for fp in removed if fp in commitedctx]
> >+
> >+    #
> >+    # Zero length files will be frequently unrelated to each other, and
> >+    # tracking the deletion/addition of such a file will probably cause more
> >+    # harm than good. We strip them out here to avoid matching them later on.
> >+    #
> 
> (The Mercurial style seems to be compact code without blank lines -
> especially in comments.)
> 
> >+    addedfiles = [x for x in addedfiles if x.size()>  0]
> >+    removedfiles = [x for x in removedfiles if x.size()>  0]
> 
> Why not do this while building the "sets" in the first place?
> 
> >+    # Find exact matches.
> >+    handledfiles = set()
> >+    for (a, b) in _findexactmatches(repo, addedfiles, removedfiles):
> >+        handledfiles.add(b)
> >+        yield (a, b, 1.0)
> 
> Shouldn't "a" be "removed" from removedfiles as well?

The findrenames() code (both before these patches and after) allows
multiple "new" files to be copied from the same "old" file.

Your other comments are quite correct, and I will look into fixing them.

Thanks for the review,
David



More information about the Mercurial-devel mailing list