auto rename: best matches and speed improvement

Sat Sep 27 18:47:31 CDT 2008

Matt Mackall wrote:
> On Sat, 2008-08-16 at 02:16 +0200, Herbert Griebel wrote:
>> What really is lacking is the quality of the diff. I have files with almost no similarity
>> which get a score of almost 50%. I will take a look at that.
> 
> I think this is simply taking the size of a delta as the measure of
> similarity. And that effectively doesn't count deletions. So if the
> delta says "delete whole file, replace with smaller one", the ratio of
> delta to original file might not be too bad. Of course, this should
> actually score 0%.

I checked the binary diff algorithm to see why the similarity values
are wrong and found the reason: the calculation of the score was
faulty. It is

        score = equal*2.0 / (len(aa) + len(rr))

but should be

        score = float(equal) / max([len(aa), len(rr)])

(Variable "equal" is the number of equal bytes).

The old formula only worked if the two files had the same size,
like in moves, otherwise it produced much larger and wrong similarties.

This is also fixed in my latest patch.