Exploring a new data structure for the dirstate

Fri Jan 4 15:40:03 CST 2013

On 01/04/2013 12:08 PM, Matt Mackall wrote:
> I'm pretty mixed on this. As it is, dict construction is pretty fast, as
> you've already discovered. This despite having to grow the dict several
> times! In essence, we're trading:
>
>    O(n) construction + O(n log n) sorted traversal + O(1) lookup
>
> for
>
>    O(n log n) construction + O(n) traversal + O(log n) lookup

The list of files in the dirstate is much smaller than the list of all 
the possible values we can have for a given maximum string length. In 
other words, log n << string length.

Hash tables aren't O(1) for strings -- they're O(string length) since to 
compute the hash of a string you need to read the entire string.

Lookup/insertion/deletion in a trie is also more accurately described as 
O(string length). In practice the number of nodes followed down is the 
number of bits that differ from any other string already in the trie, 
which is typically much less than the string length.

So more accurately, we're trading

   O(n * string length) construction + O(n log n * string length) sorted 
traversal [1] + O(string length) lookup

for

   O(n * string length) construction + O(n * string length) traversal + 
O(string length) lookup.

There's also the prefix lookup operation, which is performed by anything 
that uses dirstate._dirs. That's impossible with a hash table (and the 
reason for the _dirs map with all its complicated bookkeeping [2]), but 
simply O(prefix length) with a trie.

Now a case could perhaps be made that the constant factor for a dict is 
lower than for a trie -- perhaps because of more cache misses due to all 
the indirection -- but this needs to be backed up empirically, and at 
least during construction time, once I have a sorted dirstate, I'm not 
seeing that in practice.

[1] (n log n) from the sort algorithm and (string length) from the 
comparison function.
[2] http://selenic.com/repo/hg/file/2c1276825e93/mercurial/dirstate.py#l30

> When we don't do a traversal, it's not so attractive.
>
> As you've noted (and as I've recently discussed with Bryan), the big win
> is in having a sorted version on disk. And you don't need a new data
> structure in C to do that..
>
> Before making yet another non-trivial investment in C data structures,
> I'd like to see how far we can get with:
>
> - on read, store the sorted order

Into what, a list? Python lacks a data structure for efficiently storing 
a mutable set of items in a sorted manner.

>    - detecting already sorted input should be fast
>    - otherwise, sort
> - cache the sorted order for subsequent traversals and writes

To add a file to the dirstate, you'd have to find the right position of 
the file in the list. With a sorted list you'd have to do a binary 
search, and the operations you'd have to perform there would be 
equivalent to constructing a binary tree. Inserting into the middle of a 
Python list is O(n). That + supporting prefix lookups really seem to 
indicate that a trie would be a great approach here.

I also think it isn't a very significant investment. As someone who 
hadn't written a line of C interfacing with Python before, I was able to 
go from the C code in that github repo (which implements just trees, not 
maps) to parsing the dirstate and returning a Python-accessible map in 
two days.