[PATCH 2 of 5] findrenames: Optimise "addremove -s100" by matching files by their SHA1 hashes

Mads Kiilerich mads at kiilerich.com
Sun Mar 21 15:21:17 CDT 2010


David Greenaway wrote, On 03/21/2010 08:33 AM:
> # HG changeset patch
> # User David Greenaway<hg-dev at davidgreenaway.com>
> # Date 1269151891 -39600
> # Node ID 153c602f7b20c95484033d2b91d05171f62f91b2
> # Parent  8bf2c8f1b35705f4f5b615aaf954a3cb034c5dc0
> findrenames: Optimise "addremove -s100" by matching files by their SHA1 hashes.
>    
...
> +def findrenames(repo, added, removed, threshold):
> +    '''find renamed files -- yields (before, after, score) tuples'''
> +    commitedctx = repo['.']
> +    workingctx = repo[None]
>
> +    # Fetch contexts for added and removed files.
> +    addedfiles = [workingctx[fp] for fp in added]
> +    removedfiles = [commitedctx[fp] for fp in removed if fp in commitedctx]
> +
> +    #
> +    # Zero length files will be frequently unrelated to each other, and
> +    # tracking the deletion/addition of such a file will probably cause more
> +    # harm than good. We strip them out here to avoid matching them later on.
> +    #
>    

(The Mercurial style seems to be compact code without blank lines - 
especially in comments.)

> +    addedfiles = [x for x in addedfiles if x.size()>  0]
> +    removedfiles = [x for x in removedfiles if x.size()>  0]
>    

Why not do this while building the "sets" in the first place?

> +    # Find exact matches.
> +    handledfiles = set()
> +    for (a, b) in _findexactmatches(repo, addedfiles, removedfiles):
> +        handledfiles.add(b)
> +        yield (a, b, 1.0)
>    

Shouldn't "a" be "removed" from removedfiles as well?

Perhaps it would be simpler to let addedfiles and removedfiles be sets 
from the beginning and just remove handled files here. (The sets could 
perhaps be converted to (non-mutated) lists in the call to 
_findexactmatches.

> +    # If the user requested similar files to be matched, search for
> +    # them also.
> +    if threshold<  1.0:
> +        # Remove files already discovered.
> +        addedfiles = [x for x in addedfiles if x.path() not in handledfiles]
> +
> +        # Find partial matches.
> +        for (a, b, score) in _findsimilarmatches(repo,
> +                addedfiles, removedfiles, threshold):
> +            yield (a, b, score)
>    

/Mads


More information about the Mercurial-devel mailing list