[RFC] external revsets and templates

Thu Jul 14 06:26:35 EDT 2016

I think it would also be very cool to introduce file:// protocol to load revsets from files.

From: Mercurial-devel <mercurial-devel-bounces at mercurial-scm.org> on behalf of Gregory Szorc <gregory.szorc at gmail.com>
Date: Thursday, July 14, 2016 at 7:50 AM
To: Matt Mackall <mpm at selenic.com>
Cc: mercurial-devel <mercurial-devel at mercurial-scm.org>
Subject: Re: [RFC] external revsets and templates

On Wed, Jul 13, 2016 at 11:37 AM, Matt Mackall <mpm at selenic.com<mailto:mpm at selenic.com>> wrote:
These are features we've been playing with a bit in the new review flow:

https://www.mercurial-scm.org/wiki/AcceptProcess#Setting_up_the_revset_helper<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_wiki_AcceptProcess-23Setting-5Fup-5Fthe-5Frevset-5Fhelper&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Pp-gQYFgs4tKlSFPF5kfCw&m=qyqfG10KH1WJTuRVunmTbnh956UHjSnZnFqpTOK7SHA&s=6r62jkKn1Z5fVpBtS5P5QVG6WhAxf8w4fh5wMwTFQQU&e=>

These have been extremely useful and I'd like to make them core feature, so I'd
like to further iron out the syntax and feature set before moving forward.

Currently, external revsets works like this:

 [extrevset]
 foo = shell:some-shell-command

Then some-shell-command is expected to return a series of Mercurial identifiers
(hash, rev, tag..), one per line. When "foo" is used in a revset, Mercurial
calls the shell command, looks up each result, and returns a corresponding
revset.

I think we should also be able to support arguments:

 [extrevset]
 cvs = shell:/path/to/lookup-cvs-rev $1

Then we can do:

 $ hg log -r "cvs(123)"

Cool idea with lots of potential for quick hacks.
One item we may want to bikeshed is scaling. We know from existing revsets that lazy evaluation helps.

Presumably we could lazy read process output so e.g. if we're only interested in 10 items the command doesn't spend a long time printing 1M revisions. I /think/ by keeping the size of the inter-process pipe in check we can cause premature blocking in the invoked process to throttle how much work it is doing.
What about the scenario where we want to examine a limited set? e.g. "-r not public() & externalset()". Should the external process read candidate revisions from stdin and filter so it doesn't do too much work? I /think/ this behavior could be optional.

Also, we should allow data sources that are arbitrary URLs:

 [extrevset]
 tested = url:http://build.corp.example.com/hg-tested.dat<https://urldefense.proofpoint.com/v2/url?u=http-3A__build.corp.example.com_hg-2Dtested.dat&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Pp-gQYFgs4tKlSFPF5kfCw&m=qyqfG10KH1WJTuRVunmTbnh956UHjSnZnFqpTOK7SHA&s=SokzxzqKUd_-SL2myw7e1_1KOfMExyvjSxg_L09bceI&e=>
 good = url:http://build.corp.example.com/hg-passed.dat<https://urldefense.proofpoint.com/v2/url?u=http-3A__build.corp.example.com_hg-2Dpassed.dat&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Pp-gQYFgs4tKlSFPF5kfCw&m=qyqfG10KH1WJTuRVunmTbnh956UHjSnZnFqpTOK7SHA&s=IGUMi7KlwlZVo-lZZ_4xIR_Nw4p5SyaU3UIe5ERL_-g&e=>
 deployed = url:http://prod.example.com/hg-deployed.cgi<https://urldefense.proofpoint.com/v2/url?u=http-3A__prod.example.com_hg-2Ddeployed.cgi&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Pp-gQYFgs4tKlSFPF5kfCw&m=qyqfG10KH1WJTuRVunmTbnh956UHjSnZnFqpTOK7SHA&s=O7EPdeLDPtw1HbDoNqI7dcqIsOEjYJOkN16rGdLEUBo&e=>
 fulltext = url:http://hg<https://urldefense.proofpoint.com/v2/url?u=http-3A__hg&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Pp-gQYFgs4tKlSFPF5kfCw&m=qyqfG10KH1WJTuRVunmTbnh956UHjSnZnFqpTOK7SHA&s=gucwGJ_QvTKtSIADs4SnGx4k0f7wcqdoYSODKooIP1k&e=>
-fulltext-db.example.com/query?string=$1<https://urldefense.proofpoint.com/v2/url?u=http-3A__fulltext-2Ddb.example.com_query-3Fstring-3D-241&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Pp-gQYFgs4tKlSFPF5kfCw&m=qyqfG10KH1WJTuRVunmTbnh956UHjSnZnFqpTOK7SHA&s=jCU3Y885nzi-kJGxe8_QoukY6wcwSQVUrpDZSylfnHI&e=>

..which will allow very easy integration with complex production automation. The
url: piece might be redundant here? We might also allow calling Python, similar
to how we allow it in hooks.

That's really hot. We probably want to bikeshed HTTP semantics a bit. e.g. how do you differentiate between an empty result and a server error. HTTP status code? If the server streams results via chunked transfer and hits a server error mid stream, how do we detect that (HTTP 200 has already been issued). Do we need some kind of light protocol in the content stream? That would be nice. But it does take away the simplicity. I suppose it could be optional for those wishing to opt into stronger guarantees. e.g. "//HGREVSETBEGIN\nrev0\nrev1\n//HGREVSETEND."

My current implementation has no caching, which is usually fine. My plan is to
cache the non-argument version for the repo object lifetime and leave the
argument version uncached, but the chg use case might need a better plan.

External templates are very similar and allow adding data to the display side
(including in hgweb!).

As cool as this sounds, the security and latency aspects of this scare me. But since this is something you have to explicitly configure, a server operator can make reasonable judgements, I suppose.

Instead of simply getting a list of revisions, it gets a
list of revision[space]description pairs. For instance, I can currently get a
list of reviewers on draft changesets thusly:

 [exttemplate]
 reviewers = shell:ssh mercurial-cm accept/reviewed

..and simply add {reviewers} to my log template. Again, this can be used for
many things, like displaying number of test failures, deployment status,
mappings to other SCMs or review tools.

Caching here is more important as templates get evaluated once per changeset. My
current hack keeps a global cache, but caching per repo is probably saner.

Passing unbound lists of revisions into HTTP seemingly requires a custom protocol or sending multiple requests.
Also, new process creation on Windows is ~10x slower than POSIX systems. If you spawn hundreds or even dozens of processes on Windows you are going to have a bad time. Process re-use would be really nice. But I worry that requires too much of a "protocol" and raises the barrier to entry too much. Perhaps we need separate "namespaces" for processes that talk a protocol versus ones that do single process per item?

Because the data format for external templates is a superset of the one used by
external revsets, the same source can probably be shared in the cases where it
makes sense.

Thoughts?

This is a really cool idea that will allow people to extend Mercurial's querying and formatting abilities (2 major selling points over e.g. Git) without requiring an extension. That's huge. The existing proposal should work well on most repos. Of course, I have to support a very large repo and Windows, so I naturally have scaling concerns. But I suppose if you need the perf you can write an extension.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160714/8268d567/attachment.html>