[PATCH 0 of 5] RFC: Implement O(n + m) algorithm for findrenames()

Benoit Boissinot benoit.boissinot at ens-lyon.org
Sun Mar 7 17:24:26 CST 2010


[added the list back in cc, was it an error?]

On Mon, Mar 08, 2010 at 12:08:43AM +0100, Mads Kiilerich wrote:
> Benoit Boissinot wrote, On 03/07/2010 07:12 PM:
> >On Sun, Mar 07, 2010 at 06:01:06PM +0100, Mads Kiilerich wrote:
> >>Another comment:
> >>How many files are unchanged and indentical in 2.6.24 and 2.6.33?
> >>The linux code churn is impressive, but I think that very often in
> >>similar cases many of the files will be moved/renamed but not
> >>modified. Would it make sense to make a first cheap scan only
> >>looking for 100% match? Perhaps by using the cheap "len" as a coarse
> >>hash value? How much improvement will this give in your example?
> >
> >not modified is already optimized in the current code (sha1 match, which
> >in most case doesn't even need to retrieve the data from teh filelog,
> >see fctx.cmp()).
> 
> FWIW it is not obvious to me where that optimization is implemented
> and how efficient it is.

filelog.py.cmp()

which tests if it is a rename, if not calls revlog.cmp() which checks
the sha1's.
> 
> I had convinced myself that we always include the ancestry in the
> calculated and stored sha1 value, but I'm glad to learn that I'm
> wrong and that we also have the hash of the revisions content
> available.

We don't have it, but we don't need it (see revlog.py).
> 
> But still, O(n*m) isn't good - especially for n=m=10000. The more
> real-world cases we can change to O(n+m) the better. Davids patches
> seems to do that quite well in all cases - at the cost of increased
> code complexity and high but linear computational complexity. We
> should consider how much can be achieved at a lower cost.

Indeed especially if we want to use it in more places.


Benoit

-- 
:wq


More information about the Mercurial-devel mailing list