Dirstate perf and format change idea

Matt Mackall mpm at selenic.com
Wed Nov 11 17:17:12 CST 2015


On Wed, 2015-11-11 at 13:19 -0800, Durham Goode wrote:
> I was brain storming with Ryan and Eric about hg status performance
> and 
> we had some ideas that could make hg status with hgwatchman instant
> on 
> large repos.  But it requires an ondisk format change, so I wanted to
> throw it by you guys.
> 
> Problem:
> With hgwatchman enabled, hg status performance on a repo with a large
> number of files is dominated by
> A) parsing the data (370ms)

I'm guessing a lot of this time is putting parsed entries into a
growing dictionary?

> B) iterating over the dirstate looking for added/removed/lookup files
> (350ms)

We could also have the parser build supplemental lists to avoid this
step.

> C) 100ms of GC time

Can we disable the GC here?

> A2) add a bloom filter to the beginning of the dirstate file (maybe
> just 
> after the parent nodes). This will allow us to check if a given file
> is 
> in the dirstate cheaply, so detecting untracked files is cheap.

My concern here would be how expensive is it to build the filter? We
don't want to raise the cost of writing the dirstate much either. Bit
ops probably mean dedicated C code.

> In theory we could then read just the first few kilobytes of the 
> dirstate and leave the rest alone (or fall back to a binary search if
> we 
> need to look up a normal file).
> 
> I tested this idea by dumping the dirstate contents into sqlite and 
> using sqlite to answer questions Q1 and Q2 during status, instead of 
> parsing the dirstate.  It results in a 300ms 'hg status' even with 
> millions of files in the working copy.
> 
> Thoughts? It probably couldn't be changed upstream initially, but 
> perhaps with a new .hg/requirement it could be deployed eventually.

I like Sid's cache-on-the-side idea.

If we want to think about a new dirstate format, there are a bunch of
things I'd add:

- awareness of directories / tree-structured / stem compression
- checksums for files in lookup state so we don't have to visit revlogs
- sorted order

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial-devel mailing list