Exploring a new data structure for the dirstate
Siddharth Agarwal
sid0 at fb.com
Fri Jan 4 15:40:03 CST 2013
On 01/04/2013 12:08 PM, Matt Mackall wrote:
> I'm pretty mixed on this. As it is, dict construction is pretty fast, as
> you've already discovered. This despite having to grow the dict several
> times! In essence, we're trading:
>
> O(n) construction + O(n log n) sorted traversal + O(1) lookup
>
> for
>
> O(n log n) construction + O(n) traversal + O(log n) lookup
The list of files in the dirstate is much smaller than the list of all
the possible values we can have for a given maximum string length. In
other words, log n << string length.
Hash tables aren't O(1) for strings -- they're O(string length) since to
compute the hash of a string you need to read the entire string.
Lookup/insertion/deletion in a trie is also more accurately described as
O(string length). In practice the number of nodes followed down is the
number of bits that differ from any other string already in the trie,
which is typically much less than the string length.
So more accurately, we're trading
O(n * string length) construction + O(n log n * string length) sorted
traversal [1] + O(string length) lookup
for
O(n * string length) construction + O(n * string length) traversal +
O(string length) lookup.
There's also the prefix lookup operation, which is performed by anything
that uses dirstate._dirs. That's impossible with a hash table (and the
reason for the _dirs map with all its complicated bookkeeping [2]), but
simply O(prefix length) with a trie.
Now a case could perhaps be made that the constant factor for a dict is
lower than for a trie -- perhaps because of more cache misses due to all
the indirection -- but this needs to be backed up empirically, and at
least during construction time, once I have a sorted dirstate, I'm not
seeing that in practice.
[1] (n log n) from the sort algorithm and (string length) from the
comparison function.
[2] http://selenic.com/repo/hg/file/2c1276825e93/mercurial/dirstate.py#l30
> When we don't do a traversal, it's not so attractive.
>
> As you've noted (and as I've recently discussed with Bryan), the big win
> is in having a sorted version on disk. And you don't need a new data
> structure in C to do that..
>
> Before making yet another non-trivial investment in C data structures,
> I'd like to see how far we can get with:
>
> - on read, store the sorted order
Into what, a list? Python lacks a data structure for efficiently storing
a mutable set of items in a sorted manner.
> - detecting already sorted input should be fast
> - otherwise, sort
> - cache the sorted order for subsequent traversals and writes
To add a file to the dirstate, you'd have to find the right position of
the file in the list. With a sorted list you'd have to do a binary
search, and the operations you'd have to perform there would be
equivalent to constructing a binary tree. Inserting into the middle of a
Python list is O(n). That + supporting prefix lookups really seem to
indicate that a trie would be a great approach here.
I also think it isn't a very significant investment. As someone who
hadn't written a line of C interfacing with Python before, I was able to
go from the C code in that github repo (which implements just trees, not
maps) to parsing the dirstate and returning a Python-accessible map in
two days.
More information about the Mercurial-devel
mailing list