Exploring a new data structure for the dirstate

Fri Jan 4 18:41:45 CST 2013

On Fri, 2013-01-04 at 13:40 -0800, Siddharth Agarwal wrote:
> On 01/04/2013 12:08 PM, Matt Mackall wrote:
> > I'm pretty mixed on this. As it is, dict construction is pretty fast, as
> > you've already discovered. This despite having to grow the dict several
> > times! In essence, we're trading:
> >
> >    O(n) construction + O(n log n) sorted traversal + O(1) lookup
> >
> > for
> >
> >    O(n log n) construction + O(n) traversal + O(log n) lookup
> 
> The list of files in the dirstate is much smaller than the list of all 
> the possible values we can have for a given maximum string length. In 
> other words, log n << string length.

Yes, if I have 100-character strings, I'm probably going to have to
chase far less than 800 pointers. I am not reassured.

To force a max depth of 800, you only need 801 files. Start with
100/character/name... and add names to the tree that differ at each bit.

But you have the data: let's see a histogram of depth for 180k files.

> Hash tables aren't O(1) for strings -- they're O(string length) since to 
> compute the hash of a string you need to read the entire string.

Yes, but there are BIG constant factor differences here:

 - strings are sequential in memory and thus prefetch nicely
 - strings cover a small number of 64-byte cache lines
 - hashing is pretty fast (~1GB/s):
        $ python -m timeit '"a" * 100000000'
        10 loops, best of 3: 51 msec per loop
        $ python -m timeit 'hash("a" * 100000000)'
        10 loops, best of 3: 172 msec per loop
 - Python memoizes its hashes

 - chasing N critbit pointers means fetching N non-sequential cachelines
 - a single non-sequential L2 cache miss can cost hundreds of cyclesmore
than hashing a long path
 - chasing 10 cached critbit pointers is probably slower than hashing
100 characters
 - a perfectly-balanced critbit tree for 180k files is going to be at
least 18 levels deep
 - 180k 24-byte internal nodes is > 4MB: going to spill out of L2

Also note that even critbit is going to have to compare the search key
against the leaf node, as will basically any data structure. But unless
you're dealing with a small set of extravagantly large strings.. that
tends to be in the noise.

> There's also the prefix lookup operation, which is performed by anything 
> that uses dirstate._dirs. That's impossible with a hash table (and the 
> reason for the _dirs map with all its complicated bookkeeping [2]), but 
> simply O(prefix length) with a trie.

And with a sorted list.

Again:

> > As you've noted (and as I've recently discussed with Bryan), the big win
> > is in having a sorted version on disk. And you don't need a new data
> > structure in C to do that..

The win is not in the data structure in _memory_, it's with the data
structure _on disk_. Here are your own numbers:

building a dict: .1s
sorting dict keys (not in place): .2s
building a critbit tree from unsorted data: .25s
building a critbit tree from sorted data: .11s

You don't get rid of that .2s of sorting by throwing a tree at the
problem (building trees and sorting are morally equivalent). You get rid
of it by _reading and writing sorted data_.

> > Before making yet another non-trivial investment in C data structures,
> > I'd like to see how far we can get with:
> >
> > - on read, store the sorted order
> 
> Into what, a list? Python lacks a data structure for efficiently storing 
> a mutable set of items in a sorted manner.

A vanilla list. Insert? Don't care much, we probably do hundreds or
thousands of reads for every insert with this data structure. Throw the
list out, rebuild at write time.

> I also think it isn't a very significant investment.

Sorry, you are simply wrong. Every single time we've added more C to
Mercurial, it's resulted in at least one nasty bug to chase down, and
usually several portability bugs. Said code ends up being at least an
order of magnitude harder to hack on than the Python code it replaces,
with an order of magnitude fewer people interested in touching it
afterwards. The initial cost may be attractive, but the ongoing
maintenance cost and opportunity cost

If you were not wrong, Mercurial would not have been written in Python
to start with. Before I started writing Mercurial, I was a kernel hacker
for over a decade; I chose Python precisely because I knew how expensive
C is to develop in. And now I've got most of a decade of evidence
convincing me that was the right choice, putting aside my lack of
enthusiasm for Py3k. You may have even heard of another comparable SCM
written mostly in C that a major internet company gave up on because it
was such a pain to hack on.

As a real-world example of this cost, it literally took me 5 minutes to
prototype the core of my "keep a sorted copy cached" idea including
fixing walk(), but it ran into a wall because the read and write side
are written in C now, so I can't actually do a performance comparison
without spending an hour or so on the C side rather making two more
one-liner changes.

The answer to "more C" will _always_ be "convince me it can't be done in
Python". Then my follow-up answer will be "convince me the win is worth
the ongoing maintenance cost". If the win isn't huge, don't expect me to
be excited.

This is not to say I'm not interested in this approach, just that I need
convincing.

-- 
Mathematics is the supreme nostalgia of our time.