Good riddance to the lazy index!

Tue Jan 18 01:08:33 CST 2011

On Jan 18, 2011 7:32 AM, "Matt Mackall" <mpm at selenic.com> wrote:
>
> On Sat, 2011-01-15 at 17:12 +0100, Benoit Boissinot wrote:
''''
> > As you said on IRC, this makes verify a lot slower (same for hgtk,
> > probably for the same reasons).
> > You said you fixed it. I assume it's not yet pushed, but does it fix
> > it for just verify (for example by forcing a preload of the nodemap)
> > or is it more generic and it will work for hgtk.
>
> Here's some theory.
>
> Consider a repository with N changesets. There are two important
> patterns of operations we can have per Mercurial invocation:
>
> a) look up a small number of changesets, generally near the tip (let's
> call this distance M, with M usually << N)
>
> b) look up a significant fraction of all changesets (ie operations that
> walk history)
>
> And there are several strategies we can use (with their related costs):
>
> 1) build a nodemap or similar structure at startup with cost O(N log N)
>   a) effectively O(N log N) cost for looking up single revisions, ouch!
>   b) O(N log N) cost for visiting all revisions, pretty good
>
> 2) build a nodemap only when needed, linear otherwise
>   a) O(M) lookup cost, and M is small - awesome!
>   b) O(N log N) cost here - good, but we need to explicitly remember to
> build the map wherever we might need it (ouch!)
>
> 3) progressively build the nodemap as we do our linear scan
>   a) O(M log M) cost for individual lookups - slower than 2a
>   b) O(N log N) cost for bulk lookup - slower than 1b and 2b
>
> The code currently in tip is strategy (2), and we're missing a couple
> hints that we need to build the map. Strategy (3) is pretty good, but
> there's a fairly large constant factor difference between (2a) and (3a)
> because Python inner loops are slow.
>
> Note that we're limited to no better than O(N log N) with Python because
> hash tables are dynamically reallocated O(log N) times as they grow, and
> there's no good way to pre-size them.
>
> In an ideal world, we could do:
>
> 4) progressively fill in a full-sized hash table
>   a) O(M) single lookup
>   b) O(N) bulk lookup
>
> That could be done by replacing the list object we use for an index with
> a C extension with on-demand parsing and searching, which is what I'm
> currently researching. It'd probably give us an order of magnitude speed
> improvement for (a) and (b), as well as dropping the index parsing time
> substantially.
>
> In the mean time, I'll probably push a quick implementation of (3) to
> avoid the current trouble.
>
FYI I tried 3 during the sprint and it was quite painful since you can't
just override __getitem__ but also del, get, etc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20110118/e5834e7d/attachment.htm>