[PATCH] findrenames: sort files not by object id but by path for stable result

Ryan McElroy rm at fb.com
Wed Mar 22 12:14:01 EDT 2017


On 3/22/17 1:48 PM, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya at tcha.org>
> # Date 1426413536 -32400
> #      Sun Mar 15 18:58:56 2015 +0900
> # Node ID 425379e72e817b6c79e955173b015b9a0fe090e9
> # Parent  102f291807c92864a2231e5e925d6cd64783bb59
> findrenames: sort files not by object id but by path for stable result
>
> It seems the original implementation tried to sort added/removed files
> alphabetically, but actually it did sort fctx objects by address.
>
> This patch removes the use of set()s in order to preserve the order of
> added/removed files. addedfiles.remove() could be slightly slower than
> before, but it doesn't make much difference.
...except perhaps on a large number of files added/removed, right? Why 
not re-sort and move the possible O(n^2) to at worst O(n log n)? It 
doesn't seem like it would be too much more work to me.
>
> benchmark (on tmpfs):
>    $ hg up -C 15afda349b11; hg purge --all; mv tests tests.new
>    $ LANG=C hg --time addremove -n > /dev/null
>
>    original:   real 0.420 secs (user 0.390+0.000 sys 0.030+0.000)
>    this patch: real 0.430 secs (user 0.390+0.000 sys 0.040+0.000)
I think the algorithm scales O(n^2), so I'd be much more interested in a 
move of, say, 100k files than a move of 1 file that is dominated by 
startup time.

>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -101,19 +101,18 @@ def findrenames(repo, added, removed, th
>       # Zero length files will be frequently unrelated to each other, and
>       # tracking the deletion/addition of such a file will probably cause more
>       # harm than good. We strip them out here to avoid matching them later on.
> -    addedfiles = set([workingctx[fp] for fp in added
> -            if workingctx[fp].size() > 0])
> -    removedfiles = set([parentctx[fp] for fp in removed
> -            if fp in parentctx and parentctx[fp].size() > 0])
> +    addedfiles = [workingctx[fp] for fp in sorted(added)
> +                  if workingctx[fp].size() > 0]
> +    removedfiles = [parentctx[fp] for fp in sorted(removed)
> +                    if fp in parentctx and parentctx[fp].size() > 0]
>   
>       # Find exact matches.
> -    for (a, b) in _findexactmatches(repo,
> -            sorted(addedfiles), sorted(removedfiles)):
> +    for (a, b) in _findexactmatches(repo, addedfiles[:], removedfiles):
Since you're making a copy here anyway, why not do the removes from a 
set and then re-sort the set to a list afterwards?

Should give us the "best of both worlds", right?

I'll drop this change request if 100k moves isn't noticeably different 
in speed though.

>           addedfiles.remove(b)
>           yield (a.path(), b.path(), 1.0)
>   
>       # If the user requested similar files to be matched, search for them also.
>       if threshold < 1.0:
> -        for (a, b, score) in _findsimilarmatches(repo,
> -                sorted(addedfiles), sorted(removedfiles), threshold):
> +        for (a, b, score) in _findsimilarmatches(repo, addedfiles,
> +                                                 removedfiles, threshold):
>               yield (a.path(), b.path(), score)
> diff --git a/tests/test-addremove-similar.t b/tests/test-addremove-similar.t
> --- a/tests/test-addremove-similar.t
> +++ b/tests/test-addremove-similar.t
> @@ -55,6 +55,57 @@ comparing two empty files caused ZeroDiv
>   
>     $ hg commit -m B
>   
> +should be sorted by path for stable result
> +
> +  $ for i in `python $TESTDIR/seq.py 0 9`; do
> +  >     cp small-file $i
> +  > done
> +  $ rm small-file
> +  $ hg addremove
> +  adding 0
> +  adding 1
> +  adding 2
> +  adding 3
> +  adding 4
> +  adding 5
> +  adding 6
> +  adding 7
> +  adding 8
> +  adding 9
> +  removing small-file
> +  recording removal of small-file as rename to 0 (100% similar)
> +  recording removal of small-file as rename to 1 (100% similar)
> +  recording removal of small-file as rename to 2 (100% similar)
> +  recording removal of small-file as rename to 3 (100% similar)
> +  recording removal of small-file as rename to 4 (100% similar)
> +  recording removal of small-file as rename to 5 (100% similar)
> +  recording removal of small-file as rename to 6 (100% similar)
> +  recording removal of small-file as rename to 7 (100% similar)
> +  recording removal of small-file as rename to 8 (100% similar)
> +  recording removal of small-file as rename to 9 (100% similar)
> +  $ hg commit -m '10 same files'
> +
> +  $ cp 0 a
> +  $ for i in `python $TESTDIR/seq.py 0 9`; do
> +  >     echo $i >> $i
> +  > done
> +  $ hg commit -m 'make them slightly different'
> +  $ rm `python $TESTDIR/seq.py 0 9`
> +  $ hg addremove -s50
> +  removing 0
> +  removing 1
> +  removing 2
> +  removing 3
> +  removing 4
> +  removing 5
> +  removing 6
> +  removing 7
> +  removing 8
> +  removing 9
> +  adding a
> +  recording removal of 9 as rename to a (99% similar)
> +  $ hg commit -m 'always the same file should be selected'
> +
>   should all fail
>   
>     $ hg addremove -s foo



More information about the Mercurial-devel mailing list