[PATCH 5 of 5 V2] changegroup: compute seen files as changesets are added (issue4750)

Gregory Szorc gregory.szorc at gmail.com
Wed Jul 15 17:32:08 CDT 2015


# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1436998444 25200
#      Wed Jul 15 15:14:04 2015 -0700
# Node ID e588eefb8ae309b18d4a2e4e28134d363fd9cf8e
# Parent  ed5d4ffdff877bfe632376a17a5a50cece817209
changegroup: compute seen files as changesets are added (issue4750)

Before this patch, addchangegroup() would walk the changelog and compute
the set of seen files between applying changesets and applying
manifests. When cloning large repositories such as mozilla-central,
this consumed a non-trivial amount of time. On my MBP, this walk takes
~10s. On a dainty EC2 instance, this was measured to take ~125s! On the
latter machine, this delay was enough for the Mercurial server to
disconnect the client, thinking it had timed out, thus causing a clone
to abort.

This patch enables the changelog to compute the set of changed files as
new revisions are added. By doing so, we:

* avoid a potentially heavy computation between changelog and manifest
  processing by spreading the computation across all changelog additions
* avoid extra reads from the changelog by operating on the data as it is
  added

On my MBP, the total CPU times for an `hg unbundle` with a local
mozilla-central gzip bundle containing 251,934 changesets and 211,065
files are as follows:

before: 360.1s
after:  359.0s

While the new time does appear to be a bit faster, I think this is
within the margin of error, so not net change in performance was
achieved.

In addition, there is no longer a visible pause between applying
changeset and manifest data. Before, it sure felt like Mercurial was
lethargic making this transition. Now, the transition is nearly
instantaneous, giving the impression that Mercurial is faster. Of course,
eliminating this pause means that the potential for network disconnect due
to channel inactivity during the changelog walk is eliminated as well.
And that is the impetus behind this change and the somewhat hacky
features required for it.

diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -719,9 +719,8 @@ def addchangegroup(repo, source, srctype
     if not source:
         return 0
 
     changesets = files = revisions = 0
-    efiles = set()
 
     tr = repo.transaction("\n".join([srctype, util.hidepassword(url)]))
     # The transaction could have been created before and already carries source
     # information. In this case we use the top level data. We overwrite the
@@ -752,17 +751,21 @@ def addchangegroup(repo, source, srctype
                                  total=self._total)
                 self._count += 1
         source.callback = prog(_('changesets'), expectedtotal)
 
+        efiles = set()
+        def onchangelog(cl, entry):
+            efiles.update(entry[3])
+
         source.changelogheader()
-        srccontent = cl.addgroup(source, csmap, trp)
+        srccontent = cl.addgroup(source, csmap, trp,
+                                 addrevisioncb=onchangelog)
+        efiles = len(efiles)
+
         if not (srccontent or emptyok):
             raise util.Abort(_("received changelog group is empty"))
         clend = len(cl)
         changesets = clend - clstart
-        for c in xrange(clstart, clend):
-            efiles.update(repo[c].files())
-        efiles = len(efiles)
         repo.ui.progress(_('changesets'), None)
 
         # pull off the manifest group
         repo.ui.status(_("adding manifests\n"))


More information about the Mercurial-devel mailing list