[PATCH 5 of 6] hidden: add hiddenset and updatevisibility logic

Durham Goode durham at fb.com
Mon May 22 19:07:13 EDT 2017


On 5/18/17 8:25 PM, Gregory Szorc wrote:
> On Thu, May 18, 2017 at 11:23 AM, Durham Goode <durham at fb.com
> <mailto:durham at fb.com>> wrote:
>
>     # HG changeset patch
>     # User Durham Goode <durham at fb.com <mailto:durham at fb.com>>
>     # Date 1495129620 25200
>     #      Thu May 18 10:47:00 2017 -0700
>     # Node ID 0979ce9d40803103441b6df1ebefafe80bb9a057
>     # Parent  2aef3c4be6ba81e9d2453760515517df6fc2b816
>     hidden: add hiddenset and updatevisibility logic
>
>     This adds a function that can incrementally update the hidden state
>     for a
>     repository when given a list of nodes that have been changed. This
>     will be
>     called by all actions that can affect hiddenness (bookmarks, tags,
>     obsmarkers,
>     working copy parents). It uses a hash of the inputs that affect the
>     hidden store
>     to ensure things are up-to-date.
>
>     This has the nice property that all logic for propagating hiddenness is
>     contained to one function, and all the consumers just have to report
>     what
>     commits they touched. If an external extension wants to modify how
>     hiding is
>     computed, it can just wrap this one function.
>
>     It also is nice because it updates the hidden state incrementally,
>     instead of
>     recomputing the entire repo at once.
>
>
> Most comments inline. My biggest concerns so far in this series are
> around performance and cache semantics. You said in a separate email
> (and the code reinforces) that this file is a cache. Yet it isn't in
> .hg/cache nor does it have a repo requirements entry guarding it. The
> nearest analogue is the phaseroots file. But that's a canonical store
> (not a cache) and old clients don't care about it (like caches) so we
> don't need a requirements file. So there's a bit of work that needs to
> be done to turn this into a proper cache.

My long term hopes were that it would stop being a cache, but you are 
right. For the interim it should be in .hg/cache.

>     diff --git a/mercurial/hidden.py b/mercurial/hidden.py
>     --- a/mercurial/hidden.py
>     +++ b/mercurial/hidden.py
>     @@ -5,12 +5,38 @@
>
>      from __future__ import absolute_import
>
>     +import hashlib
>     +import heapq
>     +
>      from .i18n import _
>     -from .node import bin, hex
>     +from .node import bin, hex, nullid, nullrev
>      from . import (
>          error,
>     +    obsolete,
>     +    tags as tagsmod,
>     +    util,
>      )
>
>     +class lazychildset(object):
>     +    """Data structure for efficiently answering the query 'what are
>     my children'
>     +    for a series of revisions."""
>     +    def __init__(self, repo):
>     +        self.repo = repo.unfiltered()
>     +        self._childmap = {}
>     +        self._minprocessed = len(repo.changelog)
>     +
>     +    def get(self, rev):
>     +        childmap = self._childmap
>     +        cl = self.repo.changelog
>     +        if rev < self._minprocessed:
>     +            for r in xrange(rev, self._minprocessed):
>     +                for p in cl.parentrevs(r):
>     +                    if p != nullrev:
>     +                        childmap.setdefault(p, []).append(r)
>     +            self._minprocessed = rev
>     +
>     +        return childmap.get(rev, ())
>     +
>      class rootset(object):
>          """A commit set like object where a commit being in the set
>     implies that all
>          descendants of that commit are also in the set.
>     @@ -135,3 +161,168 @@ class rootset(object):
>          def __len__(self):
>              self._load()
>              return len(self._cache)
>     +
>     +class hiddenset(rootset):
>     +    """A set representing what commits should be hidden. It
>     contains logic to
>     +    update itself based on what nodes are reported as having been
>     changed.
>     +    """
>     +    def __init__(self, repo, opener, name):
>     +        super(hiddenset, self).__init__(repo, opener, name)
>     +        self._load()
>     +
>     +    def _load(self):
>     +        if self.roots is None:
>     +            try:
>     +                cachehash = None
>     +                raw = self.opener.read(self.filename)
>     +                if len(raw) >= 20:
>     +                    cachehash = raw[:20]
>     +                    raw = raw[20:]
>     +                else:
>     +                    raise RuntimeError("invalid cache")
>
>
> RuntimeError feels like the wrong exception here since lots of Python
> code can raise it and we don't want to be catching and swallowing
> unrelated errors. Can you invent a dummy type for this or validate the
> exception in the except block?
>
>
>     +
>     +                realhash = self._gethash()
>
>
> _gethash() calls _gethiderevs() which queries a bunch of things. I'm
> worried it is doing too much for a perf critical read path.
> Specifically, the call to obsolete.getrevs(repo, 'obsolete') requires
> loading the obsstore. I thought a point of this work was alleviate
> scaling problems of the obsstore? Note that since the obsstore is append
> only, perhaps we only need to store and compare the size of the obsstore
> file for hash verification?

The first step is to get all the scaling problems into one area (in this 
case _gethash), then we can iterate on why _gethash is slow, like using 
the obstore's new hash function.

> Another concern is loading tags. I'd have to look at the code, but if
> this needs to resolve .hgtags, that's not great for perf. Related to
> that, presumably this will get hooked into the visible repoview filter
> someday. Ideally, we shouldn't have to load tags, bookmarks, etc on the
> critical path of loading the default repoview.

Pierre-Yves addressed this, but this is only local tags and therefore 
doesn't read .hgtags.

> While I'm here, could v2 please include a large comment block in this
> file about the caching strategy? That really helps us re-evaluate the
> roles of caches in the future. See tags.py for inspiration.
>
>
>     +                if cachehash == realhash:
>     +                    self.roots = self._deserialize(raw)
>     +                else:
>     +                    raise RuntimeError("cache hash mismatch")
>     +            except (IOError, RuntimeError):
>     +                # Rebuild the hidden roots from scratch
>     +                self.roots = set()
>     +                self._cache = set()
>     +                revs = self.repo.revs('not public()')
>     +                self.updatevisibility(revs)
>
>
> Calling self.updatevisibility() here is problematic because
> updatevisibility() takes out a lock unconditionally and repo.hidden
> could be called by a process that doesn't have write permissions. See
> tags.py for how cache files should handle permissions errors. You'll
> also want a test for this scenario. There's one for the tags fnodes cache.

True. I can make it gracefully handle permission issues. I'll have to 
think about if there's an abstraction that would work here.

> Also, this pattern of cache I/O complete with hash verification is
> common enough and has been the source of enough bugs over the years that
> I think it is time to formalize the pattern (possibly at the vfs layer)
> in order to cut down on bugs...
>
> Finally, since this is a cache, why is it in .hg/store instead of .hg/cache?
>
>
>
>     +
>     +            self._fillcache()
>     +            self.dirty = False
>     +
>     +    def _write(self, fp):
>     +        fp.write(self._gethash())
>     +        fp.write(self._serialize())
>     +        self.dirty = False
>     +
>     +    def _gethash(self):
>     +        hiderevs, nohiderevs = self._gethiderevs()
>     +        cl = self.repo.changelog
>     +        hash = hashlib.sha1()
>     +        hash.update("hide")
>     +        hash.update(''.join(sorted(cl.node(r) for r in hiderevs)))
>     +        hash.update("nohide")
>     +        hash.update(''.join(sorted(cl.node(r) for r in nohiderevs)))
>     +
>     +        return hash.digest()
>     +
>     +    def _gethiderevs(self):
>     +        """Returns the set of revs that potentially could be hidden
>     and a set of
>     +        revs that definitely should not be hidden."""
>     +        hiderevs = set()
>     +        nohiderevs = set()
>     +
>     +        repo = self.repo
>     +        cl = repo.changelog
>     +
>     +        # Hide obsolete commits
>     +        hiderevs.update(obsolete.getrevs(repo, 'obsolete'))
>     +
>     +        # Don't hide bookmarks
>     +        nohiderevs.update(cl.rev(node) for name, node in
>     +                          repo._bookmarks.iteritems())
>     +
>     +        # Don't hide workingcopy parents
>     +        nohiderevs.update(cl.rev(node) for node in
>     repo.dirstate.parents()
>     +                          if node != nullid and node in cl.nodemap)
>     +
>     +        # Don't hide local tags
>     +        tags = {}
>     +        tagsmod.readlocaltags(repo.ui, repo, tags, {})
>     +        nohiderevs.update(cl.rev(t[0]) for t in tags.values()
>     +                          if t[0] in cl.nodemap)
>     +
>     +        # Don't hide rebase commits
>     +        if util.safehasattr(repo, '_rebaseset'):
>     +            nohiderevs.update(repo._rebaseset)
>     +
>     +        return hiderevs, nohiderevs
>     +
>     +    def _shouldhide(self, rev, childmap, hiderevs, nohiderevs):
>     +        repo = self.repo
>     +        ispublic = repo._phasecache.phase(repo, rev) == 0
>
>
> Nit: use constant in phases module instead of 0
>
>
>     +        cl = repo.changelog
>     +        return (not ispublic and
>     +                rev in hiderevs and
>     +                rev not in nohiderevs and
>     +                all(cl.node(r) in self for r in childmap.get(rev)))
>     +
>     +    def _needupdate(self, revs):
>     +        repo = self.repo
>     +        cl = repo.changelog
>     +
>     +        hiderevs, nohiderevs = self._gethiderevs()
>     +
>     +        childmap = lazychildset(repo)
>     +        for rev in revs:
>     +            node = cl.node(rev)
>     +            # If the node should be visible...
>     +            if not self._shouldhide(rev, childmap, hiderevs,
>     nohiderevs):
>
>
> Resolving repo.changelog on every invocation inside _shouldhide() can
> add overhead. I wonder if _shouldhide() should accept an iterable of
> revs and return a dict...
>
>
>     +                # ...but it's hidden
>     +                if node in self:
>     +                    return True
>     +            # Otherwise, the node should be hidden
>     +            elif node not in self:
>     +                return True
>     +
>     +        return False
>     +
>     +    def updatevisibility(self, revs):
>     +        """Updates the visibility of the given revs, and propagates
>     those
>     +        updates to the rev parents as necessary."""
>     +        self._load()
>     +
>     +        repo = self.repo
>     +        seen = set(revs)
>     +        if not seen or not self._needupdate(seen):
>     +            # Even if there are no changes, we need to rewrite the
>     file so the
>     +            # cache hash is up to date with the latest changes.
>     +            with self.opener(self.filename, 'w+', atomictemp=True)
>     as fp:
>     +                self._write(fp)
>
>
> This being a cache, it needs to handle I/O failures gracefully if the
> user doesn't have permissions.
>
>
>     +            return
>     +
>     +        # Sort the revs so we can process them in topo order
>     +        heap = []
>     +        for rev in seen:
>     +            heapq.heappush(heap, -rev)
>     +
>     +        with repo.lock():
>     +            with repo.transaction('hidden') as tr:
>
>
> Do we need a transaction here? We don't use transactions for other caches.

The transaction is nice because it let's us defer writing the cache 
until the transaction completes.  I think all writes to disk should be 
in transactions, and the fact that our current caches avoid it is just a 
symptom of the lack of support in our transaction layer.

>
>     +                hiderevs, nohiderevs = self._gethiderevs()
>     +
>     +                cl = repo.changelog
>     +                childmap = lazychildset(repo)
>     +                while heap:
>     +                    rev = -heapq.heappop(heap)
>     +
>     +                    # If it should be visible...
>     +                    node = cl.node(rev)
>     +                    if not self._shouldhide(rev, childmap, hiderevs,
>     +                                            nohiderevs):
>
>
> Another repo.changelog perf issue lurking in _shouldhide() inside a loop.
>
>
>     +                        # And it's currently invisible...
>     +                        if node in self:
>     +                            # Make it visible and propagate
>     +                            # (propogate first, since otherwise we
>     can't access
>     +                            # the parents)
>     +                            for parent in cl.parentrevs(rev):
>     +                                if parent != nullrev and parent not
>     in seen:
>     +                                    heapq.heappush(heap, -parent)
>     +                                    seen.add(parent)
>     +                            self.set(tr, cl.node(rev), False,
>     +                                     childmap=childmap)
>     +
>     +                    # If it should not be visible, and it's not
>     already hidden..
>     +                    elif node not in self:
>     +                        # Hide it and propagate (propogate first, since
>     +                        # otherwise we can't access the parents)
>     +                        for parent in cl.parentrevs(rev):
>     +                            if parent != nullrev and parent not in
>     seen:
>     +                                heapq.heappush(heap, -parent)
>     +                                seen.add(parent)
>     +                        self.set(tr, node, True, childmap=childmap)
>     diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>     --- a/mercurial/localrepo.py
>     +++ b/mercurial/localrepo.py
>     @@ -388,7 +388,7 @@ class localrepository(object):
>
>          @storecache('hidden.roots')
>          def hidden(self):
>     -        return hidden.rootset(self, self.svfs, 'hidden')
>     +        return hidden.hiddenset(self, self.svfs, 'hidden')
>
>          def close(self):
>              self._writecaches()
>     @@ -1384,6 +1384,10 @@ class localrepository(object):
>              l = self._lock(self.svfs, "lock", wait, None,
>                             self.invalidate, _('repository %s') %
>     self.origroot)
>              self._lockref = weakref.ref(l)
>     +        # Force the hidden cache to load, so it can validate it's
>     cache hash
>     +        # against the unmodified bookmark, obsolete, tags, and
>     working copy
>     +        # states.
>     +        self.hidden
>              return l
>
>          def _wlockchecktransaction(self):
>     diff --git a/tests/test-empty.t b/tests/test-empty.t
>     --- a/tests/test-empty.t
>     +++ b/tests/test-empty.t
>     @@ -26,6 +26,7 @@ Check the basic files created:
>      Should be empty:
>
>        $ ls .hg/store
>     +  hidden.roots
>
>      Poke at a clone:
>
>     @@ -49,5 +50,6 @@ Poke at a clone:
>      Should be empty:
>
>        $ ls .hg/store
>     +  hidden.roots
>
>        $ cd ..
>     diff --git a/tests/test-fncache.t b/tests/test-fncache.t
>     --- a/tests/test-fncache.t
>     +++ b/tests/test-fncache.t
>     @@ -92,6 +92,7 @@ Non store repo:
>        .hg/data/tst.d.hg
>        .hg/data/tst.d.hg/foo.i
>        .hg/dirstate
>     +  .hg/hidden.roots
>        .hg/last-message.txt
>        .hg/phaseroots
>        .hg/requires
>     @@ -129,6 +130,7 @@ Non fncache repo:
>        .hg/store/data
>        .hg/store/data/tst.d.hg
>        .hg/store/data/tst.d.hg/_foo.i
>     +  .hg/store/hidden.roots
>        .hg/store/phaseroots
>        .hg/store/undo
>        .hg/store/undo.backupfiles
>     @@ -313,6 +315,7 @@ Aborted transactions can be recovered la
>        data/z.i
>        $ hg recover
>        rolling back interrupted transaction
>     +  warning: ignoring unknown working parent 4b0f1917ccc5!
>
>
> What's this about? Accessing repo.dirstate as part of cache validation?

Yup. The issue was there before, this just caused it to show up in a 
warning.


More information about the Mercurial-devel mailing list