[PATCH 16 of 16] setdiscovery: avoid a full changelog graph traversal

Sun Nov 16 03:17:18 CST 2014

# HG changeset patch
# User Siddharth Agarwal <sid0 at fb.com>
# Date 1416127229 28800
#      Sun Nov 16 00:40:29 2014 -0800
# Node ID 693ed8e6c179a8892c9720e2937b9d30d7571ee1
# Parent  58d5580ad38482f1613ee8d639dd5ca849f61476
setdiscovery: avoid a full changelog graph traversal

We were definitely being suboptimal here: we were constructing two full sets,
one with the full set of common nodes (i.e. a graph traversal) and one with all
nodes. Then we subtract one set from the other. This whole process is
O(commits) and causes discovery to be significantly slower than it should be.

Instead, keep track of common incrementally and keep undecided as small as
possible.

This makes discovery massively faster on large repos: on one such repo, 'hg
debugdiscovery' over SSH with one commit missing on the client and five on the
server went from 4.5 seconds to 1.5. (An 'hg debugdiscovery' with no commits
missing on the client, i.e. connection startup time, was 1.2 seconds.)

diff --git a/mercurial/setdiscovery.py b/mercurial/setdiscovery.py
--- a/mercurial/setdiscovery.py
+++ b/mercurial/setdiscovery.py
@@ -40,7 +40,7 @@
 classified with it (since all ancestors or descendants will be marked as well).
 """
 
-from node import nullid
+from node import nullid, nullrev
 from i18n import _
 import random
 import util, dagutil
@@ -177,27 +177,23 @@
     # own nodes where I don't know if remote knows them
     undecided = dag.nodeset()
     # own nodes I know we both know
-    common = set()
+    # treat remote heads (and maybe own heads) as a first implicit sample
+    # response
+    common = cl.incrementalmissingrevs(srvheads)
+    commoninsample = set(n for i, n in enumerate(sample) if yesno[i])
+    common.addbases(commoninsample)
+    undecided = set(common.missingancestors(ownheads))
     # own nodes I know remote lacks
     missing = set()
 
-    # treat remote heads (and maybe own heads) as a first implicit sample
-    # response
-    common.update(dag.ancestorset(srvheads))
-    undecided.difference_update(common)
-
     full = False
     while undecided:
 
         if sample:
-            commoninsample = set(n for i, n in enumerate(sample) if yesno[i])
-            common.update(dag.ancestorset(commoninsample, common))
-
             missinginsample = [n for i, n in enumerate(sample) if not yesno[i]]
             missing.update(dag.descendantset(missinginsample, missing))
 
             undecided.difference_update(missing)
-            undecided.difference_update(common)
 
         if not undecided:
             break
@@ -206,7 +202,7 @@
             ui.note(_("sampling from both directions\n"))
             sample = _takefullsample(dag, undecided, size=fullsamplesize)
             targetsize = fullsamplesize
-        elif common:
+        elif common.hasbases():
             # use cheapish initial sample
             ui.debug("taking initial sample\n")
             sample = _takefullsample(dag, undecided, size=fullsamplesize)
@@ -228,7 +224,17 @@
         yesno = remote.known(dag.externalizeall(sample))
         full = True
 
-    result = dag.headsetofconnecteds(common)
+        if sample:
+            commoninsample = set(n for i, n in enumerate(sample) if yesno[i])
+            common.addbases(commoninsample)
+            common.removeancestorsfrom(undecided)
+
+    # heads(common) == heads(common.bases) since common represents common.bases
+    # and all its ancestors
+    result = dag.headsetofconnecteds(common.bases)
+    # common.bases can include nullrev, but our contract requires us to not
+    # return any heads in that case, so discard that
+    result.discard(nullrev)
     ui.progress(_('searching'), None)
     ui.debug("%d total queries\n" % roundtrips)