D4782: remotefilelog: import pruned-down remotefilelog extension from hg-experimental

indygreg (Gregory Szorc) phabricator at mercurial-scm.org
Mon Oct 1 22:54:10 UTC 2018


indygreg added a comment.


  So much code.
  
  I would prefer we not vendor RFL because by the time we adapt it to work properly with recent wire protocol and storage refactorings, we'll hardly have any of the original code remaining. But, you've made your point that having it in the official repo will make it easier to keep running while we rewrite the world around it. And having the code readily available will give us easier insight into future areas where we need to teach core about shallow stores. I'm specifically thinking about all the filectx/filelog hacks, including the need for prefetching. These are problems that we'll need to solve in core pretty soon now.
  
  I'd strongly prefer to remove some non-essential features from RFL so the code base is more manageable. This includes background prefetching and standalone cache server support. Those are both useful features don't get me wrong. But I don't see an obvious path for them into core at this time. Not having their code lingering around would make the RFL extension much easier to approach.
  
  Anyway, I'm willing to rubber stamp a review. But I figure @durin42 will want to comment on my comments first.

INLINE COMMENTS

> __init__.py:278-324
> +        # Wrap the stream logic to add requirements and to pass include/exclude
> +        # patterns around.
> +        def setup_streamout(repo, remote):
> +            # Replace remote.stream_out with a version that sends file
> +            # patterns.
> +            def stream_out_shallow(orig):
> +                caps = remote.capabilities()

We probably want to rip out this narrow stream clone bit, since Pulkit is implementing this properly in core.

> __init__.py:725-726
> +
> + at command('gc', [], _('hg gc [REPO...]'), norepo=True)
> +def gc(ui, *args, **opts):
> +    '''garbage collect the client and server filelog caches

Maybe rename this to `debuggc` or something else?

> __init__.py:1057
> +
> + at command('prefetch', [
> +    ('r', 'rev', [], _('prefetch the specified revisions'), _('REV')),

Add `debug` prefix?

> __init__.py:1083
> +
> + at command('repack', [
> +     ('', 'background', None, _('run in a background process'), None),

Add `debug` prefix?

> basepack.py:151
> +        try:
> +            for filename, type, stat in osutil.listdir(self.path, stat=True):
> +                id = None

This should be using a `vfs` instance for I/O.

> basepack.py:374-377
> +        self.packfp, self.packpath = opener.mkstemp(
> +            suffix=self.PACKSUFFIX + '-tmp')
> +        self.idxfp, self.idxpath = opener.mkstemp(
> +            suffix=self.INDEXSUFFIX + '-tmp')

I thought there was an API on `vfs` instances for using temp files and doing an atomic rename on close?

> basestore.py:252-256
> +        """Call this to add the given repo path to the store's list of
> +        repositories that are using it. This is useful later when doing garbage
> +        collection, since it allows us to insecpt the repos to see what nodes
> +        they want to be kept alive in the store.
> +        """

Oh hey - it is `hg share` done righter. But it should be doing this file write with a lock. If we were using the `vfs` layer, we'd get a devel warning about this.

> basestore.py:326
> +                # Don't delete pack files
> +                if '/packs/' in root:
> +                    continue

This is assuming POSIX paths.

> cacheclient.py:2
> +#!/usr/bin/env python
> +# cacheclient.py - example cache client implementation
> +#

We might want to nuke this direct-to-memcache functionality. At least as currently implemented. https://phab.mercurial-scm.org/D4774 is a much more robust solution to this general problem and we could shoehorn memcache support into that.

> connectionpool.py:1
> +# connectionpool.py - class for pooling peer connections for reuse
> +#

This is also functionality we should consider ripping out.

I mean it is potentially useful. But this type of thing should exist at the peer/ui level in core.

> contentstore.py:321
> +        if rl is None:
> +            revlogname = '00manifesttree.i'
> +            if name != '':

???

Is Facebook storing a flat and tree manifest in the same store?

This isn't functionality we'll want in core. At least not in the RFL extension. (The ability to store both versions of a manifest in the same repo is useful and I would like to create an extension to do that.)

> debugcommands.py:91-92
> +
> +def debugindex(orig, ui, repo, file_=None, **opts):
> +    """dump the contents of an index file"""
> +    if (opts.get('changelog') or

This looks to be a copy of `hg debugindex` from some point in time.

I had to split out revlog index dumping from `hg debugindex` a few weeks back in order to get things happy with the new file storage interface. RFL should probably define its own `debug*` command for dumping a pack's index.

> debugcommands.py:141-142
> +
> +def debugindexdot(orig, ui, repo, file_):
> +    """dump an index DAG as a graphviz dot file"""
> +    if not shallowrepo.requirement in repo.requirements:

This also looks like a copy.

> fileserverclient.py:1
> +# fileserverclient.py - client for communicating with the cache process
> +#

This may be a candidate for removal, since we may not want the standalone cache process feature in core.

> fileserverclient.py:81
> +
> +        def _updatecallstreamopts(self, command, opts):
> +            if command != 'getbundle':

This can be nuked if removing the stream clone feature.

> lz4wrapper.py:9-13
> +if util.safehasattr(demandimport, 'IGNORES'):
> +    # Since 670eb4fa1b86
> +    demandimport.IGNORES.update(['pkgutil', 'pkg_resources', '__main__'])
> +else:
> +    demandimport.ignore.extend(['pkgutil', 'pkg_resources', '__main__'])

This compatibility code isn't needed in core.

> remotefilectx.py:25
> +
> +class remotefilectx(context.filectx):
> +    def __init__(self, repo, path, changeid=None, fileid=None,

The amount of changes to `filectx` to get this to work scares me a little. I feel like these things need to be on the fast path to moving to core because some feel somewhat generic for all shallow storage backends.

> remotefilectx.py:80
> +
> +        for rev in range(len(cl) - 1, 0, -1):
> +            node = cl.node(rev)

This should use `cl.revs()`.

> remotefilelog.py:43
> +
> +class remotefilelog(object):
> +

We should declare that this implements `repository.filestorage` and hook things up to `test-check-interfaces.py`.

> remotefilelog.py:45
> +
> +    _generaldelta = True
> +

This is no longer part of the storage interface.

> remotefilelog.py:182
> +
> +    rawsize = size
> +

This is no longer part of the storage interface.

> remotefilelog.py:266-293
> +    def emitrevisiondeltas(self, requests):
> +        prevnode = None
> +        for request in requests:
> +            node = request.node
> +            p1, p2 = self.parents(node)
> +            if prevnode is None:
> +                prevnode = p1

https://phab.mercurial-scm.org/D4803 should allow us to use a more correct implementation.

> remotefilelogserver.py:204-206
> +            if isinstance(proto, _sshv1server):
> +                # legacy getfiles method which only works over ssh
> +                caps.append(shallowrepo.requirement)

Maybe we can drop this?

> remotefilelogserver.py:212-221
> +    def _adjustlinkrev(orig, self, *args, **kwargs):
> +        # When generating file blobs, taking the real path is too slow on large
> +        # repos, so force it to just return the linkrev directly.
> +        repo = self._repo
> +        if util.safehasattr(repo, 'forcelinkrev') and repo.forcelinkrev:
> +            return self._filelog.linkrev(self._filelog.rev(self._filenode))
> +        return orig(self, *args, **kwargs)

Eek. So just activating this extensions on the server can do bad things to linkrev resolution?

> shallowbundle.py:163-164
> +            if includepattern or excludepattern:
> +                repo.shallowmatch = match.match(repo.root, '', None,
> +                    includepattern, excludepattern)
> +            else:

This is allowing any pattern type. We recently decided to limit the types of patterns to allow. This is potentially a security issue.

> shallowrepo.py:156-163
> +        def file(self, f):
> +            if f[0] == '/':
> +                f = f[1:]
> +
> +            if self.shallowmatch(f):
> +                return remotefilelog.remotefilelog(self.svfs, f, self)
> +            else:

This should be done by wrapping `localrepo.makefilestorage()`.

> shallowrepo.py:193-204
> +        def backgroundprefetch(self, revs, base=None, repack=False, pats=None,
> +                               opts=None):
> +            """Runs prefetch in background with optional repack
> +            """
> +            cmd = [_hgexecutable(), '-R', repo.origroot, 'prefetch']
> +            if repack:
> +                cmd.append('--repack')

Do we want to rip out the background prefetching feature?

> shallowrepo.py:301-304
> +    repo.includepattern = repo.ui.configlist("remotefilelog", "includepattern",
> +                                             None)
> +    repo.excludepattern = repo.ui.configlist("remotefilelog", "excludepattern",
> +                                             None)

I feel like the narrow matcher already covers this functionality?

> remotefilelog-getflogheads.py:12-20
> + at command('getflogheads',
> +         [],
> +         'path')
> +def getflogheads(ui, repo, path):
> +    """
> +    Extension printing a remotefilelog's heads
> +

`hg debugwireproto` can be used for testing whatever uses this.

> remotefilelog-library.sh:19
> +[experimental]
> +changegroup3=True
> +[rebase]

I don't recall seeing any extension code for validating cg3 before using RFL. Should there be?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D4782

To: durin42, #hg-reviewers
Cc: indygreg, spectral, mjpieters, mercurial-devel


More information about the Mercurial-devel mailing list