Exploring a new data structure for the dirstate

Siddharth Agarwal sid0 at fb.com
Thu Jan 3 23:10:39 CST 2013


When I was looking at why Mercurial's add() function was slow, I found 
that much of the time (~60%) was spent trying to figure out all the 
directories currently tracked in the dirstate. In the past, I was also 
able to identify several other suboptimal operations, such as a sort 
operation in walk() [1] that takes around 0.2 seconds for ~180k file 
entries.

 From inspecting dirstate.py, the operations that seem to be important are:
(1) membership testing in the dirstate
(2) sorted traversal of the dirstate
(3) testing that a directory has at least one non-removed file in the 
dirstate
(4) iterating over all the files in a directory

These requirements, particularly (3) and (4), indicate that perhaps maps 
based on prefix trees (aka tries) [2] would be more appropriate than a 
simple dict. I've been testing out a particular variant called crit-bit 
trees [3], using a modified version of a C implementation [4] and the 
Python/C API.

The results are highly encouraging. So far I've mostly tested loading 
the dirstate, and for the same ~180k entries as above loading the 
dirstate into a dict with parsers.c:parse_dirstate takes 0.10 seconds. 
Reading in a dirstate generated by a dictionary into a crit-bit tree map 
takes 0.25 seconds. That sounds pretty bad, but it turns out that 
sorting the dirstate helps a lot. Reading a sorted dirstate into a 
crit-bit tree map only takes 0.11 seconds. For this relatively trivial 
hit, we should be able to get roughly O(1) operations 3 and 4 [5].

The great thing about merely sorting the dirstate is that since the 
order of entries in the dirstate file isn't defined, this is both 
forward and backward compatible. The moment a dirstate gets modified by 
an hg using critbit tree maps, it will be written out in sorted order, 
and so any parse-time perf penalties will vanish.

The code is currently not really in a presentable state, but I'm hoping 
to get a public repo up very soon. I'm going to pursue this further and 
see how far I can push this.

[1] http://selenic.com/repo/hg/file/2c1276825e93/mercurial/dirstate.py#l706
[2] http://en.wikipedia.org/wiki/Trie
[3] http://cr.yp.to/critbit.html
[4] https://github.com/jgehring/critbit89
[5] Technically it's O(length of string), but so is a hash table (since 
you need to read the entire string to calculate its hash) and no one 
says those are anything but O(1) :)


More information about the Mercurial-devel mailing list