Good riddance to the lazy index!

Tue Jan 18 09:34:24 CST 2011

On Tue, Jan 18, 2011 at 01:32, Matt Mackall <mpm at selenic.com> wrote:

> On Sat, 2011-01-15 at 17:12 +0100, Benoit Boissinot wrote:
> > On Wed, Jan 12, 2011 at 6:55 PM, Matt Mackall <mpm at selenic.com> wrote:
> > > ** You'll need to rebuild your extensions **
> > >
> > > I've just pushed some changes that get rid of the lazy index (that ugly
> > > mass of code at the start of revlog.py) and delay building the nodemap.
> > > As part of this, I've made searching for single nodes use a linear
> > > search back from tip when the nodemap isn't available. This gives a
> > > substantial improvement on a repo with 220k revisions:
> > >
> > >                  old    new
> > > tip              .647   .519
> > > parents          .726   .566   <- very close to tip
> > > log -r ab6eb0   2.108   .650   <- 70k revisions away from tip
> > > log -r c0ffee   2.069   .773   <- non-existent revision
> > > perfindex        .584   .266   <- building the nodemap is expensive
> >
> > As you said on IRC, this makes verify a lot slower (same for hgtk,
> > probably for the same reasons).
> > You said you fixed it. I assume it's not yet pushed, but does it fix
> > it for just verify (for example by forcing a preload of the nodemap)
> > or is it more generic and it will work for hgtk.
>
> Here's some theory.
>
> Consider a repository with N changesets. There are two important
> patterns of operations we can have per Mercurial invocation:
>
> a) look up a small number of changesets, generally near the tip (let's
> call this distance M, with M usually << N)
>
> b) look up a significant fraction of all changesets (ie operations that
> walk history)
>
> And there are several strategies we can use (with their related costs):
>
> 1) build a nodemap or similar structure at startup with cost O(N log N)
>   a) effectively O(N log N) cost for looking up single revisions, ouch!
>   b) O(N log N) cost for visiting all revisions, pretty good
>
> 2) build a nodemap only when needed, linear otherwise
>   a) O(M) lookup cost, and M is small - awesome!
>   b) O(N log N) cost here - good, but we need to explicitly remember to
> build the map wherever we might need it (ouch!)
>
> 3) progressively build the nodemap as we do our linear scan
>   a) O(M log M) cost for individual lookups - slower than 2a
>   b) O(N log N) cost for bulk lookup - slower than 1b and 2b
>
> The code currently in tip is strategy (2), and we're missing a couple
> hints that we need to build the map. Strategy (3) is pretty good, but
> there's a fairly large constant factor difference between (2a) and (3a)
> because Python inner loops are slow.
>
> Note that we're limited to no better than O(N log N) with Python because
> hash tables are dynamically reallocated O(log N) times as they grow, and
> there's no good way to pre-size them.

Although pre-sizing the tables would certainly be faster, the dynamic
allocation does not make them O(N log N). Each of the O(log N) allocations
copies about twice as much as the one before it, so the total amount of
copying is O(N). This means the asymptotic behavior of (2), (3), and (4)
should be the same. Whether the constant factor is appreciable is still to
be determined.

-Eric Eisner

In an ideal world, we could do:
>
> 4) progressively fill in a full-sized hash table
>   a) O(M) single lookup
>   b) O(N) bulk lookup
>
> That could be done by replacing the list object we use for an index with
> a C extension with on-demand parsing and searching, which is what I'm
> currently researching. It'd probably give us an order of magnitude speed
> improvement for (a) and (b), as well as dropping the index parsing time
> substantially.
>
> In the mean time, I'll probably push a quick implementation of (3) to
> avoid the current trouble.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20110118/37f61390/attachment.htm>