[PATCH 2 of 6] graphlog: turn getlogrevs() into a generator

Tue May 8 15:58:07 CDT 2012

# HG changeset patch
# User Patrick Mezard <patrick at mezard.eu>
# Date 1336509824 -7200
# Node ID 53db775e252c0f4f7add32b62e08096814d3ab7e
# Parent  d30440da7b38c671266ae1734db3921475a63caa
graphlog: turn getlogrevs() into a generator

This improves the poor "time to first changeset" compared to the
original log command. When running:

  $ hg log -u user

log will enumerate the changelog and display matching revisions when
they are found. But:

  $ hg log -G -u user

will first find all revisions matching the user then start to display
them.

Initially, I considered turning revset.match() into a generator. This is
doable but requires a fair amount of work. Instead,
cmdutil.increasingwindows() is reused to call the revset matcher
repeatedly. This has the nice properties of:
- Let us reorder the windows after filtering, which is necessary as the
  matcher can reorder inputs but is an internal detail not a feature.
- Let us feed the matcher with windows in changelog order, which is good
  for performances.
- Have a generator designed for log-like commands, returning small
  windows at first then batching larger ones.

I feel that calling the matcher multiple times is correct, at least with
the revsets involved in getlogrevs() because they are:
- stateless (no limit())
- respecting f(a|b) = f(a) | f(b), though I have no valid argument about
  that.

Known issues compared to log code:
- Calling the revset matcher multiple times can be slow when revset
  functions have to create expensive data structure for filtering. This
  will be addressed in a followup.
- Predicate combinations like "--user foo --user bar" or "--user foo and
  --branch bar" are inherently slower because all input revision are
  checked against the first condition, then against the second, and so
  forth. log would enumerate the input revisions once and check each of
  them once against all conditions, which is faster. There are solutions
  but nothing cheap to implement.

Some numbers against mozilla repository:

                             first line    total

  * hg log -u rnewman
  /Users/pmezard/bin/hg-2.2      0.148s   7.293s
  /Users/pmezard/bin/hgdev       0.132s   5.747s

  * hg log -u rnewman -u girard
  /Users/pmezard/bin/hg-2.2      0.146s   7.323s
  /Users/pmezard/bin/hgdev       0.136s  11.096s

  * hg log -l 10
  /Users/pmezard/bin/hg-2.2      0.137s   0.153s
  /Users/pmezard/bin/hgdev       0.128s   0.144s

  * hg log -l 10 -u rnewman
  /Users/pmezard/bin/hg-2.2      0.146s   0.265s
  /Users/pmezard/bin/hgdev       0.133s   0.236s

  * hg log -b GECKO193a2_20100228_RELBRANCH
  /Users/pmezard/bin/hg-2.2      2.332s   6.618s
  /Users/pmezard/bin/hgdev       1.972s   5.543s

  * hg log xulrunner
  /Users/pmezard/bin/hg-2.2      5.829s   5.958s
  /Users/pmezard/bin/hgdev       0.194s   6.017s

  * hg log --follow xulrunner/build.mk
  /Users/pmezard/bin/hg-2.2      0.353s   0.438s
  /Users/pmezard/bin/hgdev       0.394s   0.580s

  * hg log -u girard tools
  /Users/pmezard/bin/hg-2.2      5.853s   6.012s
  /Users/pmezard/bin/hgdev       0.195s   6.030s

  * hg log -b COMM2000_20110314_RELBRANCH --copies
  /Users/pmezard/bin/hg-2.2      2.231s   6.653s
  /Users/pmezard/bin/hgdev       1.897s   5.585s

  * hg log --follow
  /Users/pmezard/bin/hg-2.2      0.137s  14.140s
  /Users/pmezard/bin/hgdev       0.381s  44.246s

  * hg log --follow -r 80000:90000
  /Users/pmezard/bin/hg-2.2      0.127s   1.611s
  /Users/pmezard/bin/hgdev       0.147s   1.847s

  * hg log --follow -r 90000:80000
  /Users/pmezard/bin/hg-2.2      0.130s   1.702s
  /Users/pmezard/bin/hgdev       0.368s   6.106s

  * hg log --follow -r 80000:90000 js/src/jsproxy.cpp
  /Users/pmezard/bin/hg-2.2      0.343s   0.388s
  /Users/pmezard/bin/hgdev       0.437s   0.631s

  * hg log --follow -r 90000:80000 js/src/jsproxy.cpp
  /Users/pmezard/bin/hg-2.2      0.342s   0.389s
  /Users/pmezard/bin/hgdev       0.442s   0.628s

diff --git a/hgext/graphlog.py b/hgext/graphlog.py
--- a/hgext/graphlog.py
+++ b/hgext/graphlog.py
@@ -391,15 +391,41 @@
     return expr, filematcher
 
 def getlogrevs(repo, pats, opts):
-    """Return (revs, expr, filematcher) where revs is a list of
+    """Return (revs, expr, filematcher) where revs is an iterable of
     revision numbers, expr is a revset string built from log options
     and file patterns or None, and used to filter 'revs'. If --stat or
     --patch are not passed filematcher is None. Otherwise it is a
     callable taking a revision number and returning a match objects
     filtering the files to be detailed when displaying the revision.
     """
+    def increasingrevs(repo, revs, matcher):
+        # The sorted input rev sequence is chopped in sub-sequences
+        # which are sorted in ascending order and passed to the
+        # matcher. The filtered revs are sorted again as they were in
+        # the original sub-sequence. This achieve several things:
+        #
+        # - getlogrevs() now returns a generator which behaviour is
+        #   adapted to log need. First results come fast, last ones
+        #   are batched for performances.
+        #
+        # - revset matchers often operate faster on revision in
+        #   changelog order, because most filters deal with the
+        #   changelog.
+        #
+        # - revset matchers can reorder revisions. "A or B" typically
+        #   returns returns the revision matching A then the revision
+        #   matching B. We want to hide this internal implementation
+        #   detail from the caller, and sorting the filtered revision
+        #   again achieves this.
+        for i, window in cmdutil.increasingwindows(0, len(revs), windowsize=1):
+            orevs = revs[i:i + window]
+            nrevs = set(matcher(repo, sorted(orevs)))
+            for rev in orevs:
+                if rev in nrevs:
+                    yield rev
+
     if not len(repo):
-        return [], None, None
+        return iter([]), None, None
     # Default --rev value depends on --follow but --follow behaviour
     # depends on revisions resolved from --rev...
     follow = opts.get('follow') or opts.get('follow_first')
@@ -411,18 +437,17 @@
         else:
             revs = range(len(repo) - 1, -1, -1)
     if not revs:
-        return [], None, None
+        return iter([]), None, None
     expr, filematcher = _makelogrevset(repo, pats, opts, revs)
     if expr:
-        # Evaluate revisions in changelog order for performance
-        # reasons but preserve the original sequence order in the
-        # filtered result.
-        matched = set(revset.match(repo.ui, expr)(repo, sorted(revs)))
-        revs = [r for r in revs if r in matched]
+        matcher = revset.match(repo.ui, expr)
+        revs = increasingrevs(repo, revs, matcher)
     if not opts.get('hidden'):
         # --hidden is still experimental and not worth a dedicated revset
         # yet. Fortunately, filtering revision number is fast.
-        revs = [r for r in revs if r not in repo.changelog.hiddenrevs]
+        revs = (r for r in revs if r not in repo.changelog.hiddenrevs)
+    else:
+        revs = iter(revs)
     return revs, expr, filematcher
 
 def generate(ui, dag, displayer, showparents, edgefn, getrenamed=None,