Tag caching, at last

Fri Jul 10 10:34:10 CDT 2009

I finally have a working tag cache implementation.  The "ah-ha" moment
came when I realized we don't *have* to cache .hgtags contents, since
that's not the most expensive part of reading tags.  The only thing
that's critical to cache is (as Matt keeps saying) the head->filenode
mapping.  So I threw together a tag cache that *only* caches that
stuff and doesn't even try to cache .hgtags contents.

Benefits of this approach: strip and rollback are much less of a
problem.  And tag rank is not an issue, since we always read the
.hgtags contents the same way.

Drawback: it's not caching as much as could be cached, so it's
probably possible to squeeze some more performance out.  But this also
should make the eventual patch smaller, easier to review, and less
risky.

I don't have a patch suitable for review yet, since my localrepo.py is
full of scars from the last week of implementing three different tag
caches.  (Plus a "null" tag cache so it's easy to revert to the
current behaviour.)  But I'd like to give a little guided tour of
where things are going... so here goes.

Obviously, it starts in localrepository.tags().  You may recall that I
refactored a bit here, so tags() itself is now trivial:

    def tags(self):
        '''return a mapping of tag to node'''
        if self._tags is None:
            (self._tags, self._tagtypes) = self._findtags()

        return self._tags

This just means that tags() is responsible for *in-memory* caching
(self._tags, self._tagtypes), which relieves mqrepo and bookmarksrepo
of knowing about those two attrs.  So the real work is in _findtags(),
which starts like this:

    def _findtags(self):
        '''Do the hard work of finding tags.  Return a pair of dicts
        (tags, tagtypes) where tags maps tag name to node, and tagtypes
        maps tag name to a string like \'global\' or \'local\'.
        Subclasses or extensions are free to add their own tags, but
        should be aware that the returned dicts will be retained for the
        duration of the localrepo object.'''

        # XXX what tagtype should subclasses/extensions use?  Currently
        # mq and bookmarks add tags, but do not set the tagtype at all.
        # Should each extension invent its own tag type?  Should there
        # be one tagtype for all such "virtual" tags?  Or is the status
        # quo fine?

        # tag names are in UTF-8 in these two dicts
        alltags = {}                    # map tag name to (node, hist)
        tagtypes = {}

        def updatetags(filetags, tagtype):
            [...skipping: it's just the second half of current readtags()...]

        self._findglobaltags3(updatetags)

        [...skip the rest: reading localtags is the same as current]

Finding global tags moved to its own method (mainly because I have
three different implementations of it that I'm juggling right now):

    def _findglobaltags3(self, updatetags):
        cache = tagcache3(self.ui, self)
        (heads, tagfnode) = cache.readcache()

        #self.ui.debug("iterating over %d head(s) for .hgtags...\n" %
len(heads))
        seen = set()                    # set of fnode
        fctx = None
        for head in reversed(heads):    # oldest to newest
            fnode = tagfnode.get(head)
            if fnode and fnode not in seen:
                seen.add(fnode)
                if not fctx:
                    fctx = self.filectx('.hgtags', fileid=fnode)
                else:
                    fctx = fctx.filectx(fnode)

                filetags = self._readtags(fctx.data().splitlines(), fctx)
                updatetags(filetags, 'global')

        # and update the cache (if necessary)
        cache.writecache(heads, tagfnode)

And the tag cache itself is a separate class, again because I'm
juggling three of them.  But the separation of responsibilities is
fairly clean.  And so far, everything you've seen has been existing
code rearranged and refactored.  All the new stuff is in tagcache3:

class tagcache3(object):
    """Object for managing persistent tag cache.  Only needs to live for
    as long as it takes to open, read, update, and write the cache."""

    # Version 3: stores only info about heads, not the tag contents from
    # each head: i.e. this doesn't try to squeeze out the maximum in
    # performance, but it has a better chance at being correct and
    # reviewable on its own.  But this is the biggest optimization, and
    # dodges all sorts of difficulties that arise when caching .hgtags
    # content (e.g. handling strip/rollback gets harder, handling tag
    # rank is harder, etc.).

    __slots__ = ['ui',
                 'repo',
                 'shouldwrite',
                ]

    cacheversion = 0                    # format version

    def __init__(self, ui, repo):
        self.ui = ui
        self.repo = repo
        self.shouldwrite = False

    def readcache(self):
        (ui, repo) = (self.ui, self.repo)
        try:
            cachefile = repo.opener("tags.cache", "rt")
            #ui.debug("reading tag cache from %s\n" % cachefile.name)
        except IOError:
            cachefile = None
        else:
            canuse = self._readversion(cachefile)
            if not canuse:
                cachefile.close()
                cachefile = None

        # The cache file consists of lines like
        #   <headrev> <headnode> [<tagnode>]
        # where <headrev> and <headnode> redundantly identify a
        # repository head from the time the cache was written, and
        # <tagnode> is the filenode of .hgtags on that head.  Heads with
        # no .hgtags file will have no <tagnode>.  The cache is ordered
        # from tip to oldest (and that's why <headrev> is there, so a
        # quick visual check is all that's required to ensure correct
        # order).
        #
        # This information is enough to let us avoid the most expensive
        # part of finding global tags, which is looking up <tagnode> in
        # the manifest for each head.
        cacheheads = []                 # list of headnode
        cachefnode = {}                 # map headnode to filenode
        if cachefile:
            for line in cachefile:
                line = line.rstrip().split()
                head = bin(line[1])
                cacheheads.append(head)
                if len(line) == 3:
                    fnode = bin(line[2])
                    cachefnode[head] = fnode

        # Determine which heads have been destroyed by strip or
        # rollback.  That'll ensure that writecache() writes accurate
        # data, and it makes life easier for
        # localrepo._findglobaltags().
        goodheads = []
        for head in cacheheads:
            try:
                repo.changelog.rev(head)
                goodheads.append(head)
            except error.LookupError:
                pass
        if len(goodheads) < len(cacheheads):
            self.shouldwrite = True

        # Optimization #1: if the first head == current tip, there have
        # been new changesets since the cache was written.  All we need
        # to do is read .hgtags from each head, and that's up to our
        # caller (localrepo._findglobaltags()).
        if goodheads and goodheads[0] == repo.changelog.tip():
            #ui.debug("tag cache: tip not changed, so cache is up-to-date\n")
            return (goodheads, cachefnode)

        # Tip has changed, so we have to find new heads.
        currentheads = repo.heads()
        newheads = [head
                    for head in currentheads
                    if head not in set(goodheads)]
        if newheads:
            self.shouldwrite = True
        #ui.debug("tag cache: found %d uncached head(s)\n" % len(newheads))

        # Now we have to lookup the .hgtags filenode for every new head.
        # This is the most expensive part of finding tags, so
        # performance will depend primarily on the size of newheads.
        # When there is no cache file, newheads == currentheads, so
        # that's the worst case.
        for head in newheads:
            cctx = self.repo[head]
            try:
                fnode = cctx.filenode('.hgtags')
                cachefnode[head] = fnode
            except error.LookupError:
                # no .hgtags file on this head
                pass

        # Everything in newheads should be closer to tip than everything
        # in goodheads.  And both are already sorted tip-to-oldest.  So
        # we can just concatenate them.  (XXX unless someone edits the
        # cache file in a sneaky attempt to trip us up.)
        return (newheads + goodheads, cachefnode)

    def _readversion(self, cachefile):
        """Read the first line of the cache file, which contains the
        file version number.  Return true if we can use this cache file."""

        # The first line looks like "hgtagcache 0000", where 0000 is the
        # version number for the file format in hex.  (This should make
        # it possible to switch to a binary format if necessary in
        # future, where the first 16 chars will be "hgtagcache xxxx\n".)
        firstline = cachefile.next()
        if not (len(firstline) == 16 and
                firstline[:11] == 'hgtagcache '):
            self.ui.warn(_("invalid tag cache file (ignoring it)\n"))
            return False
        else:
            try:
                version = int(firstline[11:15], 16)
            except ValueError:
                ui.warn(_("invalid tag cache version (%r)") % firstline[11:15])
                return False

        if version > self.cacheversion:
            ui.warn(_("tag cache file from a later Mercurial version "
                      "(ignoring it)"))
            return False

        return True

    def writecache(self, heads, tagfnode):
        if not self.shouldwrite:
            return

        tmpfile = self.repo.opener("tags.cache.tmp", "wt")
        #self.ui.debug("writing cache file %s\n" % tmpfile.name)

        tmpfile.write("hgtagcache %04x\n" % self.cacheversion)
        for head in heads:
            rev = self.repo[head].rev()
            fnode = tagfnode.get(head)
            if fnode:
                tmpfile.write("%d %s %s\n" % (rev, hex(head), hex(fnode)))
            else:
                tmpfile.write("%d %s\n" % (rev, hex(head)))

        tmpfilename = tmpfile.name
        cachefilename = self.repo.join("tags.cache")
        tmpfile.close()

        os.rename(tmpfilename, cachefilename)

So that's it.  The whole test suite passes with one tiny change to
test-empty.out (an empty repo now contains tags.cache if you've done
any tag-reading operation, which test-empty does).

Please let me know what you think.  I'm not convinced that having the
tag cache in another class is necessary, but it's been very handy for
me staying sane while juggling multiple implementations.  It also
leaves us the possibility of leaving in a 'nulltagcache' class, so
that you can easily revert to 1.3 behaviour if you see tag caching
bugs between now and 1.4.

Also, I'm not *opposed* to caching .hgtags content.  It's just a
harder problem than I expected, and it's a harder problem than caching
head/fnode info.  And the two problems can be tackled separately,
thank goodness.  If we want to cache .hgtags content, IMHO that
belongs in a separate patch.  The big performance win should come with
the above code.

Greg