[PATCH 2 of 4] parsers: write dirstate starting with non-normal entries

Mon Nov 30 11:53:28 CST 2015

On Nov 25, 2015, at 6:24 AM, Yuya Nishihara <yuya at tcha.org<mailto:yuya at tcha.org>> wrote:

On Tue, 24 Nov 2015 17:19:09 -0800, Laurent Charignon wrote:
# HG changeset patch
# User Laurent Charignon <lcharignon at fb.com<mailto:lcharignon at fb.com>>
# Date 1448413597 28800
#      Tue Nov 24 17:06:37 2015 -0800
# Node ID ea9d03d4e85ea3949bb8d16bd9e1a80246a8247b
# Parent  3bd86861a1618aabe6ec7f2cde1223282f9569be
parsers: write dirstate starting with non-normal entries

Before this patch we were writing the dirstate entries in a "random" way,
following the *unstable* order of a Python dictionary. This patch changes the
order in which we write the dirstate entries.

We now start with the non-normal files (that have changed and likely to have
changed) and end with the normal files. This makes the job of hg status easier
as, in most cases, it will need to access the non-normal entries of the
dirstate. This new ordering allows hg status to stop iterating over the dirstate
after processing those entries.

On our large repos, for hg status, we achieve a 40% improvement.
On the same repo, the cost of this change is a slowdown for writing the
dirstate to disk (as we do two passes). I measured the execution time of
hg debugrebuilddirstate with and without the change and observed a 5% slowdown
for the overall command.

diff --git a/mercurial/parsers.c b/mercurial/parsers.c
--- a/mercurial/parsers.c
+++ b/mercurial/parsers.c
@@ -602,7 +602,10 @@
}
memcpy(p, s, l);
p += 20;
- if (0 == 0) {
+ int pass;

New variable can't be declared here according to C89, and MSVC does complain it.

I will fix that.

+ /* First pass, non normal files, second pass normal files. This is to improve
+  * status performance as status generally only need the non normal files */
+ for (pass = 0; pass <= 1; pass++) {
for (pos = 0; PyDict_Next(map, &pos, &k, &v); ) {
dirstateTupleObject *tuple;
char state;
@@ -610,6 +613,7 @@
Py_ssize_t len, l;
PyObject *o;
char *t;
+ int normal;

if (!dirstate_tuple_check(v)) {
PyErr_SetString(PyExc_TypeError,
@@ -622,6 +626,9 @@
mode = tuple->mode;
size = tuple->size;
mtime = tuple->mtime;
+ normal = (state == 'n' && mtime != -1);
+ if (normal != pass)
+ continue;

As we have the pass to figure out memory size before the "first pass", maybe
we can know the offset to the first "normal" file beforehand. Then, we won't
need the "second pass".

I don't really see how that could work. We can indeed know the offset (pos here) to the first normal file beforehand. However, it is very likely to still have us go through most of the dict again since number of non-normal files is typically way smaller than number of normal files isn't it?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20151130/7571f67f/attachment.html>