[PATCH] auto rename: best matches and speed improvement UPDATE4

Sun Aug 17 07:36:43 CDT 2008

Matt Mackall wrote:
> On Sat, 2008-08-16 at 11:24 +0200, Herbert Griebel wrote:
>> Matt Mackall wrote:
>>> On Sat, 2008-08-16 at 02:16 +0200, Herbert Griebel wrote:
>>>> Matt Mackall wrote:
>>>>> Thanks for looking into this, Herbert.
>>
>>  - if the file sizes differ more than the similarity threshold,
>>    don't even read the files.
> 
> This can be extended so that files of similar sizes are compared first.

Exactly. I did that, so the files are sorted by size first: big files
first to get rid of them as early as possible (100% matches are removed
from the list and never checked again) because loading is by far most
expensive. Then I search for moves, i.e. move files with the same name and
size at the begin of the list, again big files first. This and
swapping the loops gives me around 60% speed improvement on average.

> And if we have better than xx% match, we needn't compare files more than
> xx% different in size.

In detail this is tricky. Example: an added file has best scores for
two removed files, the larger score wins, so I need the second best score
for the second removed file. Therefore taking the best match
is not enough because the best match could be lost, and the second best score
could be just the one we did not compare. What you need are
deferred-compares which are done when needed. I hope this is not too confusing.
I will give it try and implement this as well, although the search
is very fast already (~5X).

Loading is most expensive, so getting the file size first is another
huge boost. I have also done lazy loading of the outer-loop file,
so if there is no better match because of sizes, or the file extension
is not right, the file is not loaded at all.

> 
>>  - take file pathnames into account:
>>    - a *.cpp file will never get a *.bmp file
>>    - it is unlikely that a binary file will get an ascii file
> 
> We could try to compare things with similar names or matching extensions
> first. But Mercurial intentionally knows as little as possible about the
> meaning of file names and whether or not things are 'binary'.

Not searching for file extension renames is the major boost, over 90%! speed gain
with mixed files. My impression is that file extension renames are very rare so
for almost all searches not checking file extension renames would be ok.
This is why I would want to make this the default. Any objections?