dirstate and improving the performance of hg-add
Joshua Redstone
joshua.redstone at fb.com
Mon Jun 11 14:08:59 CDT 2012
Hi mercurial-devel,
I've been looking into how to improve the performance of hg-add and wanted to get people's thoughts on removing dirstate._dirs and adding a sorted list of entries to dirstate to mirror the stuff in dirstate._map.
Background:
hg-add is fast for small repos, but for larger repos, we've been seeing the time to add a file grow to over 1.5 seconds, and it's only going to get slower. Using statprof, it looks like the time goes into two activities. First, doing a case-insensitive comparison of the path to be added against all existing paths in the repo, in order to warn about adding a file that might conflict with other files in the repo on a case-insensitive file system. Second, iterating through dirstate and calculating the directory components for all paths, to check if the path being added is a dir that already exists in the repo.
A few observations:
- the number of case-insensitive comparisons could be dramatically reduced by indexing into the dirstate rather than exhaustively iterating through the whole thing. One way to do this is to keep a sorted list of entries and doing a binary search for the path being added. It suffices to do a case-insensitive comparison with only the directory components of the entries in the list surrounded where the path would be added.
- the calculation of _dirs, iterating over every file in the repo, is done just so that _addpath() can check if the path being added corresponds to a dir that exists. This is a lot of extra work. Again, indexing can help reduce this. E.g., by doing a binary search on a sorted list, you can drastically cut down on the number of entries to be considered.
- scanning through the code in dirstate, it looks like there are other places that do something like 'if path in self._dirs', which can be optimized similar to above.
Proposal:
1. Add an index to dirstate. My inclination is to maintain a sorted list, but there are other possibilities here. In the sorted-list approach, we could populate a list of tuples of all paths and metadata info, and optionally modify _map to point into this list. Or we could leave _map as is and duplicate the paths in the sorted list.
2. Convert uses like 'if f in _dirs' to a more efficient lookup based on the index. Probably I'd add a few new routines to do the more efficient lookup.
I'm still poking around at the code, but it looks like this may cover almost all the uses of _dirs. With a bit more work, maybe we can even get rid of _dirs entirely.
As a side note, I noticed that dirstate.__iter__ calls sorted(self._map), so every time you iterate through the dirstate, you make a copy of the whole thing and sort it. I saw a few use cases where they didn't need
the sorting or the copying, so that may be another opportunity to speed things up.
Thoughts?
Josh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20120611/97257d38/attachment.html>
More information about the Mercurial-devel
mailing list