[PATCH 5 of 8] obscache: add a cache for 1/2 of the "obsolete" property

Fri May 19 17:53:20 EDT 2017

On Fri, May 19, 2017 at 04:28:04PM +0200, Pierre-Yves David wrote:
> # HG changeset patch
> # User Pierre-Yves David <pierre-yves.david at octobus.net>
> # Date 1495197986 -7200
> #      Fri May 19 14:46:26 2017 +0200
> # Node ID c2b1c81b3c4cba8cc21b1a1b92f4e34e6aa1807a
> # Parent  b5c984cd0640255b94f22d56a4bfe398b2c5de2e
> # EXP-Topic obscache
> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r c2b1c81b3c4c
> obscache: add a cache for 1/2 of the "obsolete" property

So, can you point me to other (upcoming) concrete subclasses of
dualsourcecache, or is this the only one?

Also, could we put these big new classes in obscache.py or some other
new file? I'm wary of adding more code to the already-1200-lines
obsolete.py, and since these are coming from evolve I'm hoping that
they're still easily separable from obsolete.py...

>
> Knowing if a changeset is obsolete requires two data:
>
> 1) is the change non-public,
> 2) is the changeset affected by any obsolescence markers.
>
> The phase related data has some fast implementation already. However, the
> obsmarkers based property currently requires to parse the whole obsstore, a slow
> operation.
>
> That information is monotonic (new changeset are not affected, and once they are
> affected, they will be for ever), making it is easy to cache. We introduce a new
> class dedicated to this information. That first implementation still needs to
> parse the full obsstore when updating for the sake of simplicity. It will be
> improved later to allow lighter upgrade.
>
> The next changesets will put this new cache to use.
>
> That code is coming from the evolve extension, were it matured. To keep this
> changeset simple, there are a couple of improvement in the extension that will
> be ported later.
>
> diff --git a/mercurial/obsolete.py b/mercurial/obsolete.py
> --- a/mercurial/obsolete.py
> +++ b/mercurial/obsolete.py
> @@ -1363,6 +1363,138 @@ class dualsourcecache(object):
>
>          return reset, revs, markers, (obssize, obskey)
>
> +class obscache(dualsourcecache):
> +    """cache "does a rev is the precursors of some obsmarkers" property
> +
> +    This is not directly holding the "is this revision obsolete" information,
> +    because phases data gets into play here. However, it allow to compute the
> +    "obsolescence" set without reading the obsstore content.
> +
> +    The cache use a bytearray to store that data and simply write it on disk
> +    for storage.
> +
> +    Implementation note #1:
> +
> +      The obsstore is implementing only half of the transaction logic it
> +      should. It properly record the starting point of the obsstore to allow
> +      clean rollback. However it still write to the obsstore file directly
> +      during the transaction. Instead it should be keeping data in memory and
> +      write to a '.pending' file to make the data vailable for hooks.
> +
> +      This cache is not going further than what the obsstore is doing, so it
> +      does not has any '.pending' logic. When the obsstore gains proper
> +      '.pending' support, adding it to this cache should not be too hard. As
> +      the flag always move from 0 to 1, we could have a second '.pending' cache
> +      file to be read. If flag is set in any of them, the value is 1. For the
> +      same reason, updating the file in place should be possible.
> +
> +    Implementation note #2:
> +
> +        Storage-wise, we could have a "start rev" to avoid storing useless
> +        zero. That would be especially useful for the '.pending' overlay.
> +    """
> +
> +    _filepath = 'cache/obscache-v01'
> +    _headerformat = '>q20sQQ20s'
> +
> +    _cachename = 'obscache' # used for error message
> +
> +    def __init__(self, repo):
> +        super(obscache, self).__init__()
> +        self._ondiskkey = None
> +        self._vfs = repo.vfs
> +        self._data = bytearray()
> +
> +    def get(self, rev):
> +        """return True if "rev" is used as "precursors for any obsmarkers
> +
> +        IMPORTANT: make sure the cache has been updated to match the repository
> +        content before using it"""
> +        return self._data[rev]
> +
> +    def clear(self, reset=False):
> +        """invalidate the in memory cache content"""
> +        super(obscache, self).clear(reset=reset)
> +        self._data = bytearray()
> +
> +    def _updatefrom(self, repo, revs, obsmarkers):
> +        if revs:
> +            self._updaterevs(repo, revs)
> +        if obsmarkers:
> +            self._updatemarkers(repo, obsmarkers)
> +
> +    def _updaterevs(self, repo, revs):
> +        """update the cache with new revisions
> +
> +        Newly added changesets might be affected by obsolescence markers we
> +        already have locally. So we needs to have some global knowledge about
> +        the markers to handle that question.
> +
> +        XXX performance note:
> +
> +        Right now this requires parsing all markers in the obsstore. We could
> +        imagine using various optimisation (eg: another cache, network
> +        exchange, etc).
> +
> +        A possible approach to this is to build a set of all nodes used as
> +        precursors in `obsstore._obscandidate`. If markers are not loaded yet,
> +        we could initialize it by doing a quick scan through the obsstore data
> +        and filling a (pre-sized) set. Doing so would be much faster than
> +        parsing all the obsmarkers since we would access less data, not create
> +        any object beside the nodes and not have to decode any complex data.
> +
> +        For now we stick to the simpler approach of paying the
> +        performance cost on new changesets.
> +        """
> +        node = repo.changelog.node
> +        succs = repo.obsstore.successors
> +        for r in revs:
> +            val = int(node(r) in succs)
> +            self._data.append(val)
> +        cl = repo.changelog
> +        assert len(self._data) == len(cl), (len(self._data), len(cl))
> +
> +    def _updatemarkers(self, repo, obsmarkers):
> +        """update the cache with new markers"""
> +        rev = repo.changelog.nodemap.get
> +        for m in obsmarkers:
> +            r = rev(m[0])
> +            if r is not None:
> +                self._data[r] = 1
> +
> +    def save(self, repo):
> +        """save the data to disk
> +
> +        Format is pretty simple, we serialise the cache key and then drop the
> +        bytearray.
> +        """
> +
> +        # XXX we'll need some pending related logic when the obsstore get it
> +
> +        if self._cachekey is None or self._cachekey == self._ondiskkey:
> +            return
> +
> +        cachefile = repo.vfs(self._filepath, 'w', atomictemp=True)
> +        headerdata = struct.pack(self._headerformat, *self._cachekey)
> +        cachefile.write(headerdata)
> +        cachefile.write(self._data)
> +        cachefile.close()
> +
> +    def load(self, repo):
> +        """load data from disk"""
> +        assert repo.filtername is None
> +
> +        data = repo.vfs.tryread(self._filepath)
> +        if not data:
> +            self._cachekey = self.emptykey
> +            self._data = bytearray()
> +        else:
> +            headersize = struct.calcsize(self._headerformat)
> +            self._cachekey = struct.unpack(self._headerformat,
> +                                           data[:headersize])
> +            self._data = bytearray(data[headersize:])
> +        self._ondiskkey = self._cachekey
> +
>  # mapping of 'set-name' -> <function to compute this set>
>  cachefuncs = {}
>  def cachefor(name):
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel