[PATCH 1 of 5 V5] dirstate: rename the dirstate parsing and packing methods

Mon Dec 21 14:44:39 CST 2015

> On Dec 19, 2015, at 2:06 PM, Matt Mackall <mpm at selenic.com> wrote:
> 
> On Sat, 2015-12-19 at 03:14 +0000, Laurent Charignon wrote:
>>> On Dec 18, 2015, at 3:02 PM, Matt Mackall <mpm at selenic.com> wrote:
>>> 
>>> On Thu, 2015-12-17 at 15:44 -0800, Laurent Charignon wrote:
>>>> # HG changeset patch
>>>> # User Laurent Charignon <lcharignon at fb.com>
>>>> # Date 1450380353 28800
>>>> #      Thu Dec 17 11:25:53 2015 -0800
>>>> # Node ID 44eafafb98c9d24bdff7d6c46213ffe2cf8edc8d
>>>> # Parent  7e8a883da1711008262150bb6f64131812de3e0b
>>>> dirstate: rename the dirstate parsing and packing methods
>>> 
>>> Explodes my hg on apply. The pure/ code is not available for
>>> fallback
>>> unless the C module doesn't exist at all, fallback doesn't happen
>>> on a
>>> function by function basis.
>>> 
>>> Consider this strategy:
>>> 
>>> - don't touch parse_dirstate
>>> - don't touch the read/write methods
>>> - add a simple Python func (in dirstate, not pure) to build
>>> nonnormalmap from map
>>> - attach this to a propertycache so we only build it when needed
>>> - add a C version of that function in parsers
>>> - add a clause in Python function to call C function if available
>> 
>> Thanks for the review.
>> 
>> Your solution should work but it forces us to pick one of the
>> following way build the non-normal map:
>> 1) build the non-normal map from map (ad you suggest)
>> 2) parse the dirstate file again
>> 
>> Both 1) and 2) come at a performance cost that is prohibitive
>> according to Durham's first email on the topic sent to the list.
>> The time breakdown he sent was:
>> "A) parsing the data (370ms)
> 
> Most of this cost is allocating objects and boxing values for a large
> Python data structure, and not the actual parsing of the data. If we
> were to just scan over this data to find the presumably relatively
> small set of non-normal entries, it'd be way faster.
> 
>> B) iterating over the dirstate looking for added/removed/lookup files
>> (350ms)
>> C) 100ms of GC time
>> D) 60ms of reading the raw data off disk"
>> 
>> The goal of this series was to eliminate the cost of B by increasing
>> the cost of A a little bit.
> 
> Yes, but that 350ms is in Python. What is the cost in C? A big
> dictionary scan in Python is slow:
> 
> $ python -m timeit -s 'd = {x: (0, 0, 0, 0) for x in xrange(1000000)}'
> -c 'dict((k, v) for k,v in d.iteritems() if v[0] != "n" or v[3] == -1)'
> 10 loops, best of 3: 269 msec per loop
> 
> $ python -m timeit -s 'd = {x: (0, 0, 0, 0) for x in xrange(1000000)}'
> -c 'set(k for k,v in d.iteritems() if v[0] != "n" or v[3] == -1)'
> 10 loops, best of 3: 210 msec per loop
> 
> But if you reduce it to mostly C primitives, it's fast:
> 
> $ python -m timeit -s 'd = {x: x%100 for x in xrange(1000000)}' -c
> 'set(d.itervalues())'
> 100 loops, best of 3: 17 msec per loop
> 
> And it'll be even faster in pure C.
> 
> But in fact, now that you mention it, we can probably do something
> like:
> 
> 60ms: read the data off the disk
> 20ms: scan the data for only non-normal entries
>  0ms: GC

It seems like it cannot be as cheap as you claim since we still need to populate the dmap to know about files that are not tracked.
We can get to the numbers that you suggest by changing the dirstate format to:
- Add ordering of the entries to make lookups logarithmic and/or
- Add a bloom filter to reduce the number of lookups that we do

I will do the following:

Series 1:
- add a simple Python func (in dirstate, not pure) to build
nonnormalmap from map
- add functions to keep the nonnormalmap up to date:
	- Reusing what I did in the series I sent
	- Recomputing the non-normal map every time we pack the dirstate (as we cannot change it and it changes the dmap)
- attach the nonnormalmap to a propertycache so we only build it when needed
- add a C version of that function in parsers
- add a clause in Python function to call C function if available
- add test extension to verify whole behavior pure/non-pure
* send patch to hgwatchman to speed up hg status

Series 2 RFC:
- Introduce a new extensible dirstate format containing a version number and logic to call the right pack/parse dirstate
- Add pack_dirstatev2, parse_dirstatev2 with ordered entries
- If V2 dirstate is available, make the dmap lazy by using the ordered entries

Matt, does that look good to you?

Thanks,

Laurent

> 
> That is, we add scan_nonnormal and be lazy about the map itself, saving
> an additional 350ms and probably the GC time as well. Something like
> this:
> 
> @propertycache
> def _map(self):
>     self._load()
>     return self._parsemap(self._data)
> 
> @propertycache
> def _nonnormal(self):
>     if parsers.hasattr("scan_nonnormal"):
>         return parsers.scan_nonnormal(self._data)
>     else:
> 	d = {}
>         for k, v in self._map().iteritems():
>             ...
> 	return d
> 
> -- 
> Mathematics is the supreme nostalgia of our time.