Good riddance to the lazy index!

Matt Mackall mpm at selenic.com
Tue Jan 18 00:32:36 CST 2011


On Sat, 2011-01-15 at 17:12 +0100, Benoit Boissinot wrote:
> On Wed, Jan 12, 2011 at 6:55 PM, Matt Mackall <mpm at selenic.com> wrote:
> > ** You'll need to rebuild your extensions **
> >
> > I've just pushed some changes that get rid of the lazy index (that ugly
> > mass of code at the start of revlog.py) and delay building the nodemap.
> > As part of this, I've made searching for single nodes use a linear
> > search back from tip when the nodemap isn't available. This gives a
> > substantial improvement on a repo with 220k revisions:
> >
> >                  old    new
> > tip              .647   .519
> > parents          .726   .566   <- very close to tip
> > log -r ab6eb0   2.108   .650   <- 70k revisions away from tip
> > log -r c0ffee   2.069   .773   <- non-existent revision
> > perfindex        .584   .266   <- building the nodemap is expensive
> 
> As you said on IRC, this makes verify a lot slower (same for hgtk,
> probably for the same reasons).
> You said you fixed it. I assume it's not yet pushed, but does it fix
> it for just verify (for example by forcing a preload of the nodemap)
> or is it more generic and it will work for hgtk.

Here's some theory.

Consider a repository with N changesets. There are two important
patterns of operations we can have per Mercurial invocation:

a) look up a small number of changesets, generally near the tip (let's
call this distance M, with M usually << N)

b) look up a significant fraction of all changesets (ie operations that
walk history)

And there are several strategies we can use (with their related costs):

1) build a nodemap or similar structure at startup with cost O(N log N)
   a) effectively O(N log N) cost for looking up single revisions, ouch!
   b) O(N log N) cost for visiting all revisions, pretty good

2) build a nodemap only when needed, linear otherwise
   a) O(M) lookup cost, and M is small - awesome!
   b) O(N log N) cost here - good, but we need to explicitly remember to
build the map wherever we might need it (ouch!)

3) progressively build the nodemap as we do our linear scan
   a) O(M log M) cost for individual lookups - slower than 2a
   b) O(N log N) cost for bulk lookup - slower than 1b and 2b

The code currently in tip is strategy (2), and we're missing a couple
hints that we need to build the map. Strategy (3) is pretty good, but
there's a fairly large constant factor difference between (2a) and (3a)
because Python inner loops are slow.

Note that we're limited to no better than O(N log N) with Python because
hash tables are dynamically reallocated O(log N) times as they grow, and
there's no good way to pre-size them.

In an ideal world, we could do:

4) progressively fill in a full-sized hash table
   a) O(M) single lookup
   b) O(N) bulk lookup

That could be done by replacing the list object we use for an index with
a C extension with on-demand parsing and searching, which is what I'm
currently researching. It'd probably give us an order of magnitude speed
improvement for (a) and (b), as well as dropping the index parsing time
substantially.

In the mean time, I'll probably push a quick implementation of (3) to
avoid the current trouble.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list