dirstate and improving the performance of hg-add

Idan Kamara idankk86 at gmail.com
Mon Jun 11 16:16:47 CDT 2012


On Mon, Jun 11, 2012 at 10:08 PM, Joshua Redstone <joshua.redstone at fb.com>
wrote:
>
> Hi mercurial-devel,
> I've been looking into how to improve the performance of hg-add and wanted
> to get people's thoughts on removing dirstate._dirs and adding a sorted
list
> of entries to dirstate to mirror the stuff in dirstate._map.

Might be nice to benchmark this against a prefix tree. Should consume less
memory as well.

>
> Background:
>
> hg-add is fast for small repos, but for larger repos, we've been seeing
> the time to add a file grow to over 1.5 seconds, and it's only going to
get
> slower.  Using statprof, it looks like the time goes into two activities.
>  First, doing a case-insensitive comparison of the path to be added
against
> all existing paths in the repo, in order to warn about adding a file that
> might conflict with other files in the repo on a case-insensitive file
> system.  Second, iterating through dirstate and calculating the directory
> components for all paths, to check if the path being added is a dir that
> already exists in the repo.
>
>
> A few observations:
>
> - the number of case-insensitive comparisons could be dramatically reduced
> by indexing into the dirstate rather than exhaustively iterating through
the
> whole thing.  One way to do this is to keep a sorted list of entries and
> doing a binary search for the path being added.  It suffices to do a
> case-insensitive comparison with only the directory components of the
> entries in the list surrounded where the path would be added.
>
> - the calculation of _dirs, iterating over every file in the repo, is done
> just so that _addpath() can check if the path being added corresponds to a
> dir that exists.  This is a lot of extra work.  Again, indexing can help
> reduce this.  E.g., by doing a binary search on a sorted list, you can
> drastically cut down on the number of entries to be considered.
>
> - scanning through the code in dirstate, it looks like there are other
> places that do something like 'if path in self._dirs', which can be
> optimized similar to above.
>
>
> Proposal:
>
> 1. Add an index to dirstate.  My inclination is to maintain a sorted list,
> but there are other possibilities here.  In the sorted-list approach, we
> could populate a list of tuples of all paths and metadata info, and
> optionally modify _map to point into this list.  Or we could leave _map as
> is and duplicate the paths in the sorted list.
> 2. Convert uses like 'if f in _dirs' to a more efficient lookup based on
> the index.  Probably I'd add a few new routines to do the more efficient
> lookup.
>
> I'm still poking around at the code, but it looks like this may cover
> almost all the uses of _dirs.  With a bit more work, maybe we can even get
> rid of _dirs entirely.
>
> As a side note, I noticed that dirstate.__iter__  calls sorted(self._map),
> so every time you iterate through the dirstate, you make a copy of the
whole
> thing and sort it.  I saw a few use cases where they didn't need
> the sorting or the copying, so that may be another opportunity to speed
> things up.
>
>
> Thoughts?
>
> Josh
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20120612/7ae3b2e0/attachment.html>


More information about the Mercurial-devel mailing list