MQ performance on large repo

Wed Mar 3 21:42:10 CST 2010

OK, I think we've picked the low-hanging fruit of my "qrefresh"
profile.  It's now down from ~9 sec to ~5.8 sec in an fncache repo.
(Actually I have a clone of that repo in the older 'store' format, on
the assumption that avoiding fncache would speed things up.
Performance is very similar in the two repos.)

So what's left?  First, here are the numbers again:

  qpop       3.26  3.26  3.33  3.41  3.32
  qpush      2.22  2.18  2.18  2.19  2.18
  qrefresh   5.77  5.83  5.93  5.74  5.76
  strip      1.99  2.02  1.98  2.00  2.00

Each operation was repeated five times, and I'm reporting the runtime
of all five runs in seconds.  The patch queue had 1 small patch which
modifies 1 file.

So here is the profile from that qrefresh operation:

"""
   CallCount    Recursive    Total(ms)   Inline(ms) module:lineno(function)
         347            0      0.7987      0.7987   <zlib.decompress>
           3            0      2.0807      0.7816   mercurial.revlog:1313(strip)
     +213165            0      0.7809      0.4841
+mercurial.revlog:269(__getitem__)
     +213187            0      0.1694      0.1694
+mercurial.revlog:504(__iter__)
          +5            0      0.0001      0.0000
+mercurial.transaction:19(_active)
          +3            0      0.0000      0.0000   +mercurial.revlog:522(start)
          +6            0      0.0001      0.0000   +<len>
      320858            0      1.4716      0.7254
mercurial.revlog:269(__getitem__)
     +320858            0      0.4479      0.4479   +<_struct.unpack>
        +115            0      0.2983      0.0004   +mercurial.revlog:264(load)
         340            0      0.6453      0.6308
mercurial.revlog:164(loadblock)
        +340            0      0.0127      0.0127   +<method 'read' of
'file' objects>
        +340            0      0.0012      0.0012   +<method 'seek' of
'file' objects>
        +680            0      0.0005      0.0005   +<len>
         +10            0      0.0000      0.0000   +<max>
        2767            0      0.5721      0.5721   <mercurial.osutil.listdir>
      320871            0      0.4480      0.4480   <_struct.unpack>
           2            0      0.4127      0.4014
mercurial.revlog:137(loadmap)
        +834            0      0.0113      0.0113   +<method 'read' of
'file' objects>
          +2            0      0.0000      0.0000   +<method 'seek' of
'file' objects>
      106601            0      0.9892      0.3206
mercurial.revlog:514(linkrev)
     +106582            0      0.6685      0.2380
+mercurial.revlog:269(__getitem__)
       19239            0      0.2706      0.2706   <posix.lstat>
           3            0      1.7075      0.2637   mercurial.dirstate:433(walk)
       +2767            0      0.5721      0.5721   +<mercurial.osutil.listdir>
          +2            0      0.3685      0.0456   +<zip>
          +5            0      0.0316      0.0316   +<sorted>
      +38737            0      0.0285      0.0285
+mercurial.match:74(<lambda>)
      +21963            0      0.0258      0.0258
+mercurial.dirstate:126(_join)
Time: real 9.070 secs (user 6.910+0.000 sys 1.780+0.000)
"""

If I'm interpreting this correctly, I think it means that the bulk of
the runtime is from loading the revlog index for 00changelog.i and
00manifest.i into memory.  (We see 2*N calls to revlog.__getitem__()
and __iter__() because 00changelog.i and 00manifest.i both have N
revisions.)

I don't understand why there are ~320k (3*N?) calls to struct.unpack().  Hmmmm.

One thing that occurs to me: of *course* strip needs to know the
offset of the revisions it's going to strip.  But does it really need
the entire index?  No, but revlog always ends up loading the entire
index even if we only need the last entry or the last 10 entries.

What if revlog was a little more selective about reading the index
into memory?  Could it be made even more lazy than it already is?

I don't *think* there are further tweaks that can be snuck in before
1.5, but it might be worth tossing some ideas around.  (Here's one:
mmap() the index so we don't need both a list and a map in memory.
Then try to delay building the map as much as possible.  Or maybe
write a combined index/map data structure in C.  Just thinking out
loud here.)

Greg