[PATCH 1 of 5] findrenames: Separate repository access commands from similarity algorithm

Sun Mar 7 11:44:09 CST 2010

On Sun, Mar 07, 2010 at 04:12:48AM -0000, David Greenaway wrote:
> # HG changeset patch
> # User David Greenaway <hg-dev at davidgreenaway.com>
> # Date 1267934964 -39600
> # Node ID 10649eca0e852b7f229e392f36812bbd6f89773c
> # Parent  033d2fdc3b9d3e33fd33d45109aafdb4a5cb3273
> findrenames: Separate repository access commands from similarity algorithm.
> 
> The current 'findrenames' function mixes concerns of retrieving data from the
> repository with actually computing similarity between old and new files.
> This patch splits out data retrieval back into addremove(), leaving the
> pure similarity detection algorithm in findrenames().

I'm not sure this is the way to go, if you want to separate out the
similarity algorithm, just create a new function (maybe in context.py?)

> Upcoming changes will increase the complexity of findrenames(), making these
> changes desirable. Additionally, separating the two allows findrenames() to be
> used from callers in other contexts in the future.

I really think you should not abstract data retrieval this way, the call
should have contexts anyway.

cheers,

Benoit
> 
> diff --git a/mercurial/cmdutil.py b/mercurial/cmdutil.py
> --- a/mercurial/cmdutil.py
> +++ b/mercurial/cmdutil.py
> @@ -285,23 +285,26 @@
>  def matchfiles(repo, files):
>      return _match.exact(repo.root, repo.getcwd(), files)
>  
> -def findrenames(repo, added, removed, threshold):
> -    '''find renamed files -- yields (before, after, score) tuples'''
> +def findrenames(added, removed, threshold):
> +    """
> +    Given two lists of files, yield (source, destination, score) tuples of
> +    similar files.
> +
> +    The input 'added' and 'removed' lists should be lists of tuples containing
> +    (filename, function to retrieve file data). The retrieval functions will
> +    be given a single argument: the name of the file to retrieve.
> +    """
>      copies = {}
> -    ctx = repo['.']
> -    for r in removed:

maybe just pass filectx in added/removed

> -        if r not in ctx:
> -            continue
> -        fctx = ctx.filectx(r)
> +    for (r, r_data) in removed:
> +        orig = r_data(r)
>  
>          def score(text):
>              if not len(text):
>                  return 0.0
> -            if not fctx.cmp(text):
> +            if orig == text:

then you can keep the optimized version here
>                  return 1.0
>              if threshold == 1.0:
>                  return 0.0
> -            orig = fctx.data()

and lazily load the text

-- 
:wq