[PATCH] auto rename: best matches and speed improvement UPDATE4

Fri Aug 15 20:54:01 CDT 2008

On Sat, 2008-08-16 at 02:16 +0200, Herbert Griebel wrote:
> Matt Mackall wrote:
> > On Sat, 2008-08-16 at 01:48 +0200, Herbert Griebel wrote:
> >> # HG changeset patch
> >> # User Herbert Griebel <herbertg at gmx.at>
> >> # Date 1218843942 -7200
> >> # Node ID 6cf73b7c3671f0737e25c72d45fa3ca805426df6
> >> # Parent  438e02b4be73ffd80d4a6cc45b7c9273d6b26604
> >> auto rename: best matches and speed improvement
> >>
> >> The auto rename feature of addremove did not rename to the best match.
> >> The old implementation looped over "added files" and selected the best
> >> match for each individually. The next "added file" could also have its
> >> best match with the same removed file, but which is far lower than the
> >> one before. Therefore the last "added file" wins which may not be the
> >> best.
> >> The new implementation fixes this. The rename list is also sorted in
> >> reverse order to make it easier to find the wrong matches with low
> >> similarity. For speed improvements the loops are dynamically
> >> interchanged which can speed up the matching considerably.
> >> There is also a very rough estimation of the matching time, and
> >> output has been enhanced to give the user feedback about the
> >> progress.
> > 
> > Thanks for looking into this, Herbert.
> > 
> > First off, this will want a test case.
> ok, this will take me some time, since I have no Linux currently.
> 
> > 
> > Second, I get the impression that the time complexity here is going from
> > O(n) to O(n**2), is that right?
> 
> The complexity is the same, only all matches are now checked against each other.

If I do 1000 renames, I end up doing 1000**2 comparisons, right? This
was probably already the case with the original algorithm, but it's
still a worry.

> For this there is an extra loop at the end of findrenames which is fast.
> Speed has improved because the loop order is exchanged. There are also
> no more multiple occurances in the list.
> 
> What really is lacking is the quality of the diff. I have files with almost no similarity
> which get a score of almost 50%. I will take a look at that.

I think this is simply taking the size of a delta as the measure of
similarity. And that effectively doesn't count deletions. So if the
delta says "delete whole file, replace with smaller one", the ratio of
delta to original file might not be too bad. Of course, this should
actually score 0%.

-- 
Mathematics is the supreme nostalgia of our time.