[PATCH 5 of 6] hidden: add hiddenset and updatevisibility logic
Durham Goode
durham at fb.com
Mon May 22 19:07:13 EDT 2017
On 5/18/17 8:25 PM, Gregory Szorc wrote:
> On Thu, May 18, 2017 at 11:23 AM, Durham Goode <durham at fb.com
> <mailto:durham at fb.com>> wrote:
>
> # HG changeset patch
> # User Durham Goode <durham at fb.com <mailto:durham at fb.com>>
> # Date 1495129620 25200
> # Thu May 18 10:47:00 2017 -0700
> # Node ID 0979ce9d40803103441b6df1ebefafe80bb9a057
> # Parent 2aef3c4be6ba81e9d2453760515517df6fc2b816
> hidden: add hiddenset and updatevisibility logic
>
> This adds a function that can incrementally update the hidden state
> for a
> repository when given a list of nodes that have been changed. This
> will be
> called by all actions that can affect hiddenness (bookmarks, tags,
> obsmarkers,
> working copy parents). It uses a hash of the inputs that affect the
> hidden store
> to ensure things are up-to-date.
>
> This has the nice property that all logic for propagating hiddenness is
> contained to one function, and all the consumers just have to report
> what
> commits they touched. If an external extension wants to modify how
> hiding is
> computed, it can just wrap this one function.
>
> It also is nice because it updates the hidden state incrementally,
> instead of
> recomputing the entire repo at once.
>
>
> Most comments inline. My biggest concerns so far in this series are
> around performance and cache semantics. You said in a separate email
> (and the code reinforces) that this file is a cache. Yet it isn't in
> .hg/cache nor does it have a repo requirements entry guarding it. The
> nearest analogue is the phaseroots file. But that's a canonical store
> (not a cache) and old clients don't care about it (like caches) so we
> don't need a requirements file. So there's a bit of work that needs to
> be done to turn this into a proper cache.
My long term hopes were that it would stop being a cache, but you are
right. For the interim it should be in .hg/cache.
> diff --git a/mercurial/hidden.py b/mercurial/hidden.py
> --- a/mercurial/hidden.py
> +++ b/mercurial/hidden.py
> @@ -5,12 +5,38 @@
>
> from __future__ import absolute_import
>
> +import hashlib
> +import heapq
> +
> from .i18n import _
> -from .node import bin, hex
> +from .node import bin, hex, nullid, nullrev
> from . import (
> error,
> + obsolete,
> + tags as tagsmod,
> + util,
> )
>
> +class lazychildset(object):
> + """Data structure for efficiently answering the query 'what are
> my children'
> + for a series of revisions."""
> + def __init__(self, repo):
> + self.repo = repo.unfiltered()
> + self._childmap = {}
> + self._minprocessed = len(repo.changelog)
> +
> + def get(self, rev):
> + childmap = self._childmap
> + cl = self.repo.changelog
> + if rev < self._minprocessed:
> + for r in xrange(rev, self._minprocessed):
> + for p in cl.parentrevs(r):
> + if p != nullrev:
> + childmap.setdefault(p, []).append(r)
> + self._minprocessed = rev
> +
> + return childmap.get(rev, ())
> +
> class rootset(object):
> """A commit set like object where a commit being in the set
> implies that all
> descendants of that commit are also in the set.
> @@ -135,3 +161,168 @@ class rootset(object):
> def __len__(self):
> self._load()
> return len(self._cache)
> +
> +class hiddenset(rootset):
> + """A set representing what commits should be hidden. It
> contains logic to
> + update itself based on what nodes are reported as having been
> changed.
> + """
> + def __init__(self, repo, opener, name):
> + super(hiddenset, self).__init__(repo, opener, name)
> + self._load()
> +
> + def _load(self):
> + if self.roots is None:
> + try:
> + cachehash = None
> + raw = self.opener.read(self.filename)
> + if len(raw) >= 20:
> + cachehash = raw[:20]
> + raw = raw[20:]
> + else:
> + raise RuntimeError("invalid cache")
>
>
> RuntimeError feels like the wrong exception here since lots of Python
> code can raise it and we don't want to be catching and swallowing
> unrelated errors. Can you invent a dummy type for this or validate the
> exception in the except block?
>
>
> +
> + realhash = self._gethash()
>
>
> _gethash() calls _gethiderevs() which queries a bunch of things. I'm
> worried it is doing too much for a perf critical read path.
> Specifically, the call to obsolete.getrevs(repo, 'obsolete') requires
> loading the obsstore. I thought a point of this work was alleviate
> scaling problems of the obsstore? Note that since the obsstore is append
> only, perhaps we only need to store and compare the size of the obsstore
> file for hash verification?
The first step is to get all the scaling problems into one area (in this
case _gethash), then we can iterate on why _gethash is slow, like using
the obstore's new hash function.
> Another concern is loading tags. I'd have to look at the code, but if
> this needs to resolve .hgtags, that's not great for perf. Related to
> that, presumably this will get hooked into the visible repoview filter
> someday. Ideally, we shouldn't have to load tags, bookmarks, etc on the
> critical path of loading the default repoview.
Pierre-Yves addressed this, but this is only local tags and therefore
doesn't read .hgtags.
> While I'm here, could v2 please include a large comment block in this
> file about the caching strategy? That really helps us re-evaluate the
> roles of caches in the future. See tags.py for inspiration.
>
>
> + if cachehash == realhash:
> + self.roots = self._deserialize(raw)
> + else:
> + raise RuntimeError("cache hash mismatch")
> + except (IOError, RuntimeError):
> + # Rebuild the hidden roots from scratch
> + self.roots = set()
> + self._cache = set()
> + revs = self.repo.revs('not public()')
> + self.updatevisibility(revs)
>
>
> Calling self.updatevisibility() here is problematic because
> updatevisibility() takes out a lock unconditionally and repo.hidden
> could be called by a process that doesn't have write permissions. See
> tags.py for how cache files should handle permissions errors. You'll
> also want a test for this scenario. There's one for the tags fnodes cache.
True. I can make it gracefully handle permission issues. I'll have to
think about if there's an abstraction that would work here.
> Also, this pattern of cache I/O complete with hash verification is
> common enough and has been the source of enough bugs over the years that
> I think it is time to formalize the pattern (possibly at the vfs layer)
> in order to cut down on bugs...
>
> Finally, since this is a cache, why is it in .hg/store instead of .hg/cache?
>
>
>
> +
> + self._fillcache()
> + self.dirty = False
> +
> + def _write(self, fp):
> + fp.write(self._gethash())
> + fp.write(self._serialize())
> + self.dirty = False
> +
> + def _gethash(self):
> + hiderevs, nohiderevs = self._gethiderevs()
> + cl = self.repo.changelog
> + hash = hashlib.sha1()
> + hash.update("hide")
> + hash.update(''.join(sorted(cl.node(r) for r in hiderevs)))
> + hash.update("nohide")
> + hash.update(''.join(sorted(cl.node(r) for r in nohiderevs)))
> +
> + return hash.digest()
> +
> + def _gethiderevs(self):
> + """Returns the set of revs that potentially could be hidden
> and a set of
> + revs that definitely should not be hidden."""
> + hiderevs = set()
> + nohiderevs = set()
> +
> + repo = self.repo
> + cl = repo.changelog
> +
> + # Hide obsolete commits
> + hiderevs.update(obsolete.getrevs(repo, 'obsolete'))
> +
> + # Don't hide bookmarks
> + nohiderevs.update(cl.rev(node) for name, node in
> + repo._bookmarks.iteritems())
> +
> + # Don't hide workingcopy parents
> + nohiderevs.update(cl.rev(node) for node in
> repo.dirstate.parents()
> + if node != nullid and node in cl.nodemap)
> +
> + # Don't hide local tags
> + tags = {}
> + tagsmod.readlocaltags(repo.ui, repo, tags, {})
> + nohiderevs.update(cl.rev(t[0]) for t in tags.values()
> + if t[0] in cl.nodemap)
> +
> + # Don't hide rebase commits
> + if util.safehasattr(repo, '_rebaseset'):
> + nohiderevs.update(repo._rebaseset)
> +
> + return hiderevs, nohiderevs
> +
> + def _shouldhide(self, rev, childmap, hiderevs, nohiderevs):
> + repo = self.repo
> + ispublic = repo._phasecache.phase(repo, rev) == 0
>
>
> Nit: use constant in phases module instead of 0
>
>
> + cl = repo.changelog
> + return (not ispublic and
> + rev in hiderevs and
> + rev not in nohiderevs and
> + all(cl.node(r) in self for r in childmap.get(rev)))
> +
> + def _needupdate(self, revs):
> + repo = self.repo
> + cl = repo.changelog
> +
> + hiderevs, nohiderevs = self._gethiderevs()
> +
> + childmap = lazychildset(repo)
> + for rev in revs:
> + node = cl.node(rev)
> + # If the node should be visible...
> + if not self._shouldhide(rev, childmap, hiderevs,
> nohiderevs):
>
>
> Resolving repo.changelog on every invocation inside _shouldhide() can
> add overhead. I wonder if _shouldhide() should accept an iterable of
> revs and return a dict...
>
>
> + # ...but it's hidden
> + if node in self:
> + return True
> + # Otherwise, the node should be hidden
> + elif node not in self:
> + return True
> +
> + return False
> +
> + def updatevisibility(self, revs):
> + """Updates the visibility of the given revs, and propagates
> those
> + updates to the rev parents as necessary."""
> + self._load()
> +
> + repo = self.repo
> + seen = set(revs)
> + if not seen or not self._needupdate(seen):
> + # Even if there are no changes, we need to rewrite the
> file so the
> + # cache hash is up to date with the latest changes.
> + with self.opener(self.filename, 'w+', atomictemp=True)
> as fp:
> + self._write(fp)
>
>
> This being a cache, it needs to handle I/O failures gracefully if the
> user doesn't have permissions.
>
>
> + return
> +
> + # Sort the revs so we can process them in topo order
> + heap = []
> + for rev in seen:
> + heapq.heappush(heap, -rev)
> +
> + with repo.lock():
> + with repo.transaction('hidden') as tr:
>
>
> Do we need a transaction here? We don't use transactions for other caches.
The transaction is nice because it let's us defer writing the cache
until the transaction completes. I think all writes to disk should be
in transactions, and the fact that our current caches avoid it is just a
symptom of the lack of support in our transaction layer.
>
> + hiderevs, nohiderevs = self._gethiderevs()
> +
> + cl = repo.changelog
> + childmap = lazychildset(repo)
> + while heap:
> + rev = -heapq.heappop(heap)
> +
> + # If it should be visible...
> + node = cl.node(rev)
> + if not self._shouldhide(rev, childmap, hiderevs,
> + nohiderevs):
>
>
> Another repo.changelog perf issue lurking in _shouldhide() inside a loop.
>
>
> + # And it's currently invisible...
> + if node in self:
> + # Make it visible and propagate
> + # (propogate first, since otherwise we
> can't access
> + # the parents)
> + for parent in cl.parentrevs(rev):
> + if parent != nullrev and parent not
> in seen:
> + heapq.heappush(heap, -parent)
> + seen.add(parent)
> + self.set(tr, cl.node(rev), False,
> + childmap=childmap)
> +
> + # If it should not be visible, and it's not
> already hidden..
> + elif node not in self:
> + # Hide it and propagate (propogate first, since
> + # otherwise we can't access the parents)
> + for parent in cl.parentrevs(rev):
> + if parent != nullrev and parent not in
> seen:
> + heapq.heappush(heap, -parent)
> + seen.add(parent)
> + self.set(tr, node, True, childmap=childmap)
> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -388,7 +388,7 @@ class localrepository(object):
>
> @storecache('hidden.roots')
> def hidden(self):
> - return hidden.rootset(self, self.svfs, 'hidden')
> + return hidden.hiddenset(self, self.svfs, 'hidden')
>
> def close(self):
> self._writecaches()
> @@ -1384,6 +1384,10 @@ class localrepository(object):
> l = self._lock(self.svfs, "lock", wait, None,
> self.invalidate, _('repository %s') %
> self.origroot)
> self._lockref = weakref.ref(l)
> + # Force the hidden cache to load, so it can validate it's
> cache hash
> + # against the unmodified bookmark, obsolete, tags, and
> working copy
> + # states.
> + self.hidden
> return l
>
> def _wlockchecktransaction(self):
> diff --git a/tests/test-empty.t b/tests/test-empty.t
> --- a/tests/test-empty.t
> +++ b/tests/test-empty.t
> @@ -26,6 +26,7 @@ Check the basic files created:
> Should be empty:
>
> $ ls .hg/store
> + hidden.roots
>
> Poke at a clone:
>
> @@ -49,5 +50,6 @@ Poke at a clone:
> Should be empty:
>
> $ ls .hg/store
> + hidden.roots
>
> $ cd ..
> diff --git a/tests/test-fncache.t b/tests/test-fncache.t
> --- a/tests/test-fncache.t
> +++ b/tests/test-fncache.t
> @@ -92,6 +92,7 @@ Non store repo:
> .hg/data/tst.d.hg
> .hg/data/tst.d.hg/foo.i
> .hg/dirstate
> + .hg/hidden.roots
> .hg/last-message.txt
> .hg/phaseroots
> .hg/requires
> @@ -129,6 +130,7 @@ Non fncache repo:
> .hg/store/data
> .hg/store/data/tst.d.hg
> .hg/store/data/tst.d.hg/_foo.i
> + .hg/store/hidden.roots
> .hg/store/phaseroots
> .hg/store/undo
> .hg/store/undo.backupfiles
> @@ -313,6 +315,7 @@ Aborted transactions can be recovered la
> data/z.i
> $ hg recover
> rolling back interrupted transaction
> + warning: ignoring unknown working parent 4b0f1917ccc5!
>
>
> What's this about? Accessing repo.dirstate as part of cache validation?
Yup. The issue was there before, this just caused it to show up in a
warning.
More information about the Mercurial-devel
mailing list