D1856: wireproto: server-side support for pullbundles

indygreg (Gregory Szorc) phabricator at mercurial-scm.org
Sun Jan 14 17:43:47 EST 2018


indygreg added subscribers: durin42, indygreg.
indygreg requested changes to this revision.
indygreg added a comment.
This revision now requires changes to proceed.


  This patch needs a lot of work. But I'm very supportive of the feature and the preliminary implementation!
  
  I wish we could find a way to send multiple, inline, pre-generated bundles in one response. However, the existing design of compression within the bundle2 format doesn't easily facilitate this. We should think long and hard about whether to implement this feature as //partial pull// or extend bundle2 to allow us to do nice things and avoid the extra client-server roundtrips.
  
  Speaking of compression, we probably need some more code around e.g. the mismatch between on-disk compression and wire compression. There is a potential for local bundles to be stored uncompressed and for the server to compress this data as part of streaming it out. That involves CPU. We still come out ahead because it avoids having to generate the bundle content from revlogs. But it isn't as ideal as streaming an already compressed bundle over the wire. We may want to open the bundle2 files and look at their compression info and make sure we don't do stupid things. It should be fast to parse the bundle2 header to obtain this knowledge.
  
  My biggest concern with the architecture of this feature is the multiple roundtrips. I **really** wish we could stream multiple bundles off disk to the wire with no decompression/compression involved. That would mean storing compressed bundles on disk. But this would require some additional bundle2 magic. The existing solution is simple and elegant. I do like that. I'd very much like to get the opinion of someone like @durin42 (who also likes designing protocols).

INLINE COMMENTS

> wireproto.py:834
>  
> +def find_pullbundle(repo, opts, clheads, heads, common):
> +    def decodehexstring(s):

This function needs a docstring. And the expected behavior of this function needs to be documented in the docstring and/or inline as comments.

> wireproto.py:836
> +    def decodehexstring(s):
> +        return set([h.decode('hex') for h in s.split(':')])
> +

`:` as a delimiter for nodes doesn't feel right because `:` has meaning for revsets. I would use `,` or `;`.

That being said, it may make the manifest format slightly less human friendly because URI encoding may come into play. That should be fine though: we already escape values for `BUNDLESPEC` when using `packed1` bundles.

> wireproto.py:838
> +
> +    manifest = repo.vfs.tryread('pullbundles.manifest')
> +    res = exchange.parseclonebundlesmanifest(repo, manifest)

I think there should be an `if not manifest: return` here to handle the common case.

> wireproto.py:848
> +    for entry in res:
> +        if 'heads' in entry:
> +            try:

We'll need documentation of the fields in the manifest somewhere. If we combine this feature with the `clonebundles.py` extension, that seems like the logical place to document things.

> wireproto.py:854-862
> +            if len(bundle_heads) > len(heads):
> +                # Client wants less heads than the bundle contains
> +                continue
> +            if bundle_heads.issubset(common):
> +                continue # Nothing new
> +            if all(cl.rev(rev) in common_anc for rev in bundle_heads):
> +                continue # Still nothing new

I worry about performance issues here. You were making some noise about this on IRC. So it sounds like it is on your radar.

I'm just not sure how to make things more efficient. I just know that doing so many DAG tests feels like it will lead to performance problems for repos with hundreds of thousands or millions of changesets.

But we shouldn't let performance issues derail this feature. Pull bundles have the potential to offload tens of seconds of CPU time from the server. So even if DAG operations consume a few seconds of CPU, we come out ahead. It would be nice to get that down to 100's of milliseconds at most though. But this feels like the territory of follow-up work, possibly involving caching or more efficient stores of which revisions are in which bundles.

> wireproto.py:874-877
> +        path = entry['URL']
> +        repo.ui.debug('sending pullbundle "%s"\n' % path)
> +        try:
> +            return repo.vfs.open(path)

Using a `URL` field for a vfs path (which doesn't have nor handle protocol schemes - e.g. `file://`` IIRC) feels wrong. I think the entry in the manifest should be `PATH`.

> wireproto.py:918
> +
> +        if repo.ui.configbool('server', 'pullbundle') and
> +           self.capable('partial-pull'):

I'd consider enabling this feature by default. Checking for the presence of a `pullbundles.manifest` should be a cheap operation. Especially when you consider that serving a bundle will open likely dozens of revlog files.

Another idea to consider is to tie this feature into the `clonebundles.py` extension. We already have extensive documentation in that extension for offloading server load via pre-generated bundles. I view this feature as a close sibling. I think they could live under the same umbrella. But because "pull bundles" are part of the `getbundle` wire protocol command, the bulk of the server code needs to remain in core.

> wireproto.py:919
> +        if repo.ui.configbool('server', 'pullbundle') and
> +           self.capable('partial-pull'):
> +            # Check if a pre-built bundle covers this request.

AFAICT this capability isn't defined anywhere. Is there a missing patch in this series?

I do agree with the use of a `partial-pull` capability rather than exposing //pull bundles// functionality in the capabilities layer. Pull bundles should be purely a server-side implementation detail. It's //partial pull// that is the generic feature that should be exposed as part of the wire protocol.

> wireproto.py:924
> +                return streamres(gen=util.filechunkiter(bundle),
> +                                 prefer_uncompressed=True)
> +

`prefer_uncompressed` doesn't exist.

I know you were talking about compression behavior on IRC. Please do invent whatever APIs you need to communicate to the wire protocol layer that you want to stream data with whatever compression behavior that you need.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel


More information about the Mercurial-devel mailing list