[PATCH 4 of 4] changegroup: compute seen files as changesets are added (issue4750)

Gregory Szorc gregory.szorc at gmail.com
Thu Jul 9 19:09:01 CDT 2015


# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1436485900 25200
#      Thu Jul 09 16:51:40 2015 -0700
# Node ID e86c10381256d2996583f7978c9b1b7f636ec1d8
# Parent  c9abd93973708d02d7a945cbddaba9b420a18cfa
changegroup: compute seen files as changesets are added (issue4750)

Before this patch, addchangegroup() would walk the changelog and compute
the set of seen files between applying changesets and applying
manifests. When cloning large repositories such as mozilla-central,
this consumed a non-trivial amount of time. On my MBP, this walk takes
~10s. On a dainty EC2 instance, this was measured to take ~125s! On the
latter machine, this delay was enough for the Mercurial server to
disconnect the client, thinking it had timed out, thus causing a clone
to abort.

This patch enables the changelog to compute the set of changed files as
new revisions are added. By doing so, we:

* avoid a potentially heavy computation between changelog and manifest
  processing by spreading the computation across all changelog additions
* avoid extra reads from the changelog by operating on the data as it is
  added

On my MBP, the total CPU times for an `hg unbundle` with a local
mozilla-central gzip bundle containing 251,934 changesets and 211,065
files are as follows:

before: 360.1s
after:  359.0s

While the new time does appear to be a bit faster, I think this is
within the margin of error.

In addition, there is no longer a visible pause between applying
changeset and manifest data. Before, it sure felt like Mercurial was
lethargic making this transition. Now, the transition is nearly
instantaneous, giving the impression that Mercurial is faster. Of course,
eliminating this pause means that the potential for network disconnect due
to channel inactivity during the changelog walk is eliminated as well.

diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -719,9 +719,8 @@ def addchangegroup(repo, source, srctype
     if not source:
         return 0
 
     changesets = files = revisions = 0
-    efiles = set()
 
     tr = repo.transaction("\n".join([srctype, util.hidepassword(url)]))
     # The transaction could have been created before and already carries source
     # information. In this case we use the top level data. We overwrite the
@@ -753,16 +752,19 @@ def addchangegroup(repo, source, srctype
                 self._count += 1
         source.callback = prog(_('changesets'), expectedtotal)
 
         source.changelogheader()
-        srccontent = cl.addgroup(source, csmap, trp)
+        try:
+            cl.seenfiles = set()
+            srccontent = cl.addgroup(source, csmap, trp)
+            efiles = len(cl.seenfiles)
+        finally:
+            cl.seenfiles = None
+
         if not (srccontent or emptyok):
             raise util.Abort(_("received changelog group is empty"))
         clend = len(cl)
         changesets = clend - clstart
-        for c in xrange(clstart, clend):
-            efiles.update(repo[c].files())
-        efiles = len(efiles)
         repo.ui.progress(_('changesets'), None)
 
         # pull off the manifest group
         repo.ui.status(_("adding manifests\n"))


More information about the Mercurial-devel mailing list