dirstate and improving the performance of hg-add

Joshua Redstone joshua.redstone at fb.com
Mon Jun 11 14:08:59 CDT 2012


Hi mercurial-devel,
I've been looking into how to improve the performance of hg-add and wanted to get people's thoughts on removing dirstate._dirs and adding a sorted list of entries to dirstate to mirror the stuff in dirstate._map.

Background:

hg-add is fast for small repos, but for larger repos, we've been seeing the time to add a file grow to over 1.5 seconds, and it's only going to get slower.  Using statprof, it looks like the time goes into two activities.  First, doing a case-insensitive comparison of the path to be added against all existing paths in the repo, in order to warn about adding a file that might conflict with other files in the repo on a case-insensitive file system.  Second, iterating through dirstate and calculating the directory components for all paths, to check if the path being added is a dir that already exists in the repo.


A few observations:

- the number of case-insensitive comparisons could be dramatically reduced by indexing into the dirstate rather than exhaustively iterating through the whole thing.  One way to do this is to keep a sorted list of entries and doing a binary search for the path being added.  It suffices to do a case-insensitive comparison with only the directory components of the entries in the list surrounded where the path would be added.

- the calculation of _dirs, iterating over every file in the repo, is done just so that _addpath() can check if the path being added corresponds to a dir that exists.  This is a lot of extra work.  Again, indexing can help reduce this.  E.g., by doing a binary search on a sorted list, you can drastically cut down on the number of entries to be considered.

- scanning through the code in dirstate, it looks like there are other places that do something like 'if path in self._dirs', which can be optimized similar to above.


Proposal:

1. Add an index to dirstate.  My inclination is to maintain a sorted list, but there are other possibilities here.  In the sorted-list approach, we could populate a list of tuples of all paths and metadata info, and optionally modify _map to point into this list.  Or we could leave _map as is and duplicate the paths in the sorted list.
2. Convert uses like 'if f in _dirs' to a more efficient lookup based on the index.  Probably I'd add a few new routines to do the more efficient lookup.

I'm still poking around at the code, but it looks like this may cover almost all the uses of _dirs.  With a bit more work, maybe we can even get rid of _dirs entirely.

As a side note, I noticed that dirstate.__iter__  calls sorted(self._map), so every time you iterate through the dirstate, you make a copy of the whole thing and sort it.  I saw a few use cases where they didn't need
the sorting or the copying, so that may be another opportunity to speed things up.


Thoughts?

Josh

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20120611/97257d38/attachment.html>


More information about the Mercurial-devel mailing list