[PATCH 1 of 5 V5] dirstate: rename the dirstate parsing and packing methods

Mon Dec 21 15:14:32 CST 2015

On Mon, 2015-12-21 at 20:44 +0000, Laurent Charignon wrote:
> > On Dec 19, 2015, at 2:06 PM, Matt Mackall <mpm at selenic.com> wrote:
> > 
> > On Sat, 2015-12-19 at 03:14 +0000, Laurent Charignon wrote:
> > > > On Dec 18, 2015, at 3:02 PM, Matt Mackall <mpm at selenic.com> wrote:
> > > > 
> > > > On Thu, 2015-12-17 at 15:44 -0800, Laurent Charignon wrote:
> > > > > # HG changeset patch
> > > > > # User Laurent Charignon <lcharignon at fb.com>
> > > > > # Date 1450380353 28800
> > > > > #      Thu Dec 17 11:25:53 2015 -0800
> > > > > # Node ID 44eafafb98c9d24bdff7d6c46213ffe2cf8edc8d
> > > > > # Parent  7e8a883da1711008262150bb6f64131812de3e0b
> > > > > dirstate: rename the dirstate parsing and packing methods
> > > > 
> > > > Explodes my hg on apply. The pure/ code is not available for
> > > > fallback
> > > > unless the C module doesn't exist at all, fallback doesn't happen
> > > > on a
> > > > function by function basis.
> > > > 
> > > > Consider this strategy:
> > > > 
> > > > - don't touch parse_dirstate
> > > > - don't touch the read/write methods
> > > > - add a simple Python func (in dirstate, not pure) to build
> > > > nonnormalmap from map
> > > > - attach this to a propertycache so we only build it when needed
> > > > - add a C version of that function in parsers
> > > > - add a clause in Python function to call C function if available
> > > 
> > > Thanks for the review.
> > > 
> > > Your solution should work but it forces us to pick one of the
> > > following way build the non-normal map:
> > > 1) build the non-normal map from map (ad you suggest)
> > > 2) parse the dirstate file again
> > > 
> > > Both 1) and 2) come at a performance cost that is prohibitive
> > > according to Durham's first email on the topic sent to the list.
> > > The time breakdown he sent was:
> > > "A) parsing the data (370ms)
> > 
> > Most of this cost is allocating objects and boxing values for a large
> > Python data structure, and not the actual parsing of the data. If we
> > were to just scan over this data to find the presumably relatively
> > small set of non-normal entries, it'd be way faster.
> > 
> > > B) iterating over the dirstate looking for added/removed/lookup files
> > > (350ms)
> > > C) 100ms of GC time
> > > D) 60ms of reading the raw data off disk"
> > > 
> > > The goal of this series was to eliminate the cost of B by increasing
> > > the cost of A a little bit.
> > 
> > Yes, but that 350ms is in Python. What is the cost in C? A big
> > dictionary scan in Python is slow:
> > 
> > $ python -m timeit -s 'd = {x: (0, 0, 0, 0) for x in xrange(1000000)}'
> > -c 'dict((k, v) for k,v in d.iteritems() if v[0] != "n" or v[3] == -1)'
> > 10 loops, best of 3: 269 msec per loop
> > 
> > $ python -m timeit -s 'd = {x: (0, 0, 0, 0) for x in xrange(1000000)}'
> > -c 'set(k for k,v in d.iteritems() if v[0] != "n" or v[3] == -1)'
> > 10 loops, best of 3: 210 msec per loop
> > 
> > But if you reduce it to mostly C primitives, it's fast:
> > 
> > $ python -m timeit -s 'd = {x: x%100 for x in xrange(1000000)}' -c
> > 'set(d.itervalues())'
> > 100 loops, best of 3: 17 msec per loop
> > 
> > And it'll be even faster in pure C.
> > 
> > But in fact, now that you mention it, we can probably do something
> > like:
> > 
> > 60ms: read the data off the disk
> > 20ms: scan the data for only non-normal entries
> >  0ms: GC
> 
> It seems like it cannot be as cheap as you claim since we still need to
> populate the dmap to know about files that are not tracked.
> We can get to the numbers that you suggest by changing the dirstate format to:
> - Add ordering of the entries to make lookups logarithmic and/or
> - Add a bloom filter to reduce the number of lookups that we do
> 
> I will do the following:
> 
> Series 1:
> - add a simple Python func (in dirstate, not pure) to build
> nonnormalmap from map
> - add functions to keep the nonnormalmap up to date:
> 	- Reusing what I did in the series I sent
> 	- Recomputing the non-normal map every time we pack the dirstate (as we
> cannot change it and it changes the dmap)
> - attach the nonnormalmap to a propertycache so we only build it when needed
> - add a C version of that function in parsers
> - add a clause in Python function to call C function if available
> - add test extension to verify whole behavior pure/non-pure
> * send patch to hgwatchman to speed up hg status
> 
> Series 2 RFC:
> - Introduce a new extensible dirstate format containing a version number and
> logic to call the right pack/parse dirstate
> - Add pack_dirstatev2, parse_dirstatev2 with ordered entries
> - If V2 dirstate is available, make the dmap lazy by using the ordered entries
> 
> Matt, does that look good to you?

Seems fine. I'm skeptical that you'll get reasonable performance for bloom
filters in Python though.

-- 
Mathematics is the supreme nostalgia of our time.