dirstate and improving the performance of hg-add

Joshua Redstone joshua.redstone at fb.com
Mon Jun 11 16:06:41 CDT 2012


If we can afford it, it seems like it's nice to have that warning to
maintain portability.  And, at least in principle, it seems like we should
be able to afford it with ease.
Josh

On 6/11/12 3:58 PM, "Adrian Buehlmann" <adrian at cadifra.com> wrote:

>On 2012-06-11 21:08, Joshua Redstone wrote:
>> Hi mercurial-devel,
>> I've been looking into how to improve the performance of hg-add and
>> wanted to get people's thoughts on removing dirstate._dirs and adding a
>> sorted list of entries to dirstate to mirror the stuff in dirstate._map.
>> 
>> Background:
>> 
>> hg-add is fast for small repos, but for larger repos, we've been seeing
>> the time to add a file grow to over 1.5 seconds, and it's only going to
>> get slower.  Using statprof, it looks like the time goes into two
>> activities.  First, doing a case-insensitive comparison of the path to
>> be added against all existing paths in the repo, in order to warn about
>> adding a file that might conflict with other files in the repo on a
>> case-insensitive file system.  Second, iterating through dirstate and
>> calculating the directory components for all paths, to check if the path
>> being added is a dir that already exists in the repo.
>> 
>> 
>> A few observations:
>> 
>> - the number of case-insensitive comparisons could be dramatically
>> reduced by indexing into the dirstate rather than exhaustively iterating
>> through the whole thing.  One way to do this is to keep a sorted list of
>> entries and doing a binary search for the path being added.  It suffices
>> to do a case-insensitive comparison with only the directory components
>> of the entries in the list surrounded where the path would be added.
>
>Do you actually need the default value of the config setting
>ui.portablefilenames?
>
>The default is "warn", which does that case-insensitive checking. But if
>you don't need that warning, then setting it to "ignore" or "false"
>should not do this case-insensitive comparison.
>
>> - the calculation of _dirs, iterating over every file in the repo, is
>> done just so that _addpath() can check if the path being added
>> corresponds to a dir that exists.  This is a lot of extra work.  Again,
>> indexing can help reduce this.  E.g., by doing a binary search on a
>> sorted list, you can drastically cut down on the number of entries to be
>> considered.
>> 
>> - scanning through the code in dirstate, it looks like there are other
>> places that do something like 'if path in self._dirs', which can be
>> optimized similar to above.
>> 
>> 
>> Proposal:
>> 
>> 1. Add an index to dirstate.  My inclination is to maintain a sorted
>> list, but there are other possibilities here.  In the sorted-list
>> approach, we could populate a list of tuples of all paths and metadata
>> info, and optionally modify _map to point into this list.  Or we could
>> leave _map as is and duplicate the paths in the sorted list.
>> 2. Convert uses like 'if f in _dirs' to a more efficient lookup based on
>> the index.  Probably I'd add a few new routines to do the more efficient
>> lookup.
>> 
>> I'm still poking around at the code, but it looks like this may cover
>> almost all the uses of _dirs.  With a bit more work, maybe we can even
>> get rid of _dirs entirely.
>> 
>> As a side note, I noticed that dirstate.__iter__  calls
>> sorted(self._map), so every time you iterate through the dirstate, you
>> make a copy of the whole thing and sort it.  I saw a few use cases where
>> they didn't need
>> the sorting or the copying, so that may be another opportunity to speed
>> things up.



More information about the Mercurial-devel mailing list