[PATCH 1 of 2] parsers: incrementally parse the revlog index in C

Tue Apr 3 16:15:38 CDT 2012

On Tue, 2012-04-03 at 14:03 -0700, Bryan O'Sullivan wrote:
> On Thu, Mar 29, 2012 at 12:51 PM, Matt Mackall <mpm at selenic.com> wrote:
> 
> > +index_ass_subscript(indexObject* self, PyObject* item, PyObject* value)
> >
> > I'm not sure what that abbreviation means here. Perhaps something like
> > this? http://t.co/4CQiLKim
> 
> 
> Almost. That's Pythonese for the slot that handles assignment on dict-like
> types.
> 
> 
> > I'm afraid this isn't actually backward compatible. If I update to this
> > changeset without recompiling, I get an exception:
> >
> >  ValueError: need more than 2 values to unpack
> >
> 
> I'll fix that. I'll have to find some other way to indicate that the index
> is a fancy lazyish one.
> 
> Is it really the case that the API for a C extension is so inflexible?
> That's rather unfortunate.
> 
> So I think there are two ways we can go in the future to make this
> > faster:
> >
> > - cache each entry we build
> >
> 
> That's easily enough done. I'll calloc a big array and see how caching goes.
> 
> 
> > - treat the index as a tuple of lists rather than a list of tuples
> >
> > The latter idea involves two observations. First, an operation like
> > heads() only uses two elements out of each retrieved entry, so most of
> > the work building the entry tuple is wasted. Second, having a separate
> > list per property does away with one tuple allocation per entry as well
> > as all the extra overhead of parsing a value description.
> 
> 
> I think I follow you, but it took me a little while, so let me parrot back
> what I think you said, and you can tell me if we're on the same page.
> Actually, now that I write this, I don't think it's what you said, but
> maybe I like it better.
> 
> Instead of using Py_BuildValue to create each element, we introduce another
> extension type that tracks a rev without building a tuple, and lazily
> returns the elements currently contained in the tuple (perhaps caching them
> in case of reuse?).

No, I'm suggesting something a bit more fundamental. Instead of this
data structure:

  [(x1, x2...), (y1, y2...)] # an N-element list of 8-tuples

..we want this one:

  ([x1, y1...], [x2, y2...]) # an 8-tuple of N-element lists

But again, that's for the future. Let's wrap up the existing patch
before adding any further performance tweaks.

-- 
Mathematics is the supreme nostalgia of our time.