[RFC] external revsets and templates

Thu Jul 14 17:17:30 EDT 2016

On Wed, 2016-07-13 at 23:50 -0700, Gregory Szorc wrote:
> On Wed, Jul 13, 2016 at 11:37 AM, Matt Mackall <mpm at selenic.com> wrote:
> 
> > 
> > These are features we've been playing with a bit in the new review flow:
> > 
> > 
> > https://www.mercurial-scm.org/wiki/AcceptProcess#Setting_up_the_revset_helpe
> > r
> > 
> > These have been extremely useful and I'd like to make them core feature,
> > so I'd
> > like to further iron out the syntax and feature set before moving forward.
> > 
> > Currently, external revsets works like this:
> > 
> >  [extrevset]
> >  foo = shell:some-shell-command
> > 
> > Then some-shell-command is expected to return a series of Mercurial
> > identifiers
> > (hash, rev, tag..), one per line. When "foo" is used in a revset, Mercurial
> > calls the shell command, looks up each result, and returns a corresponding
> > revset.
> > 
> > I think we should also be able to support arguments:
> > 
> >  [extrevset]
> >  cvs = shell:/path/to/lookup-cvs-rev $1
> > 
> > Then we can do:
> > 
> >  $ hg log -r "cvs(123)"
> > 
> Cool idea with lots of potential for quick hacks.
> 
> One item we may want to bikeshed is scaling. We know from existing revsets
> that lazy evaluation helps.
> 
> Presumably we could lazy read process output so e.g. if we're only
> interested in 10 items the command doesn't spend a long time printing 1M
> revisions. I /think/ by keeping the size of the inter-process pipe in check
> we can cause premature blocking in the invoked process to throttle how much
> work it is doing.

Yep, that's a pretty reasonable idea. My current hack doesn't do this, but it's
easy to add.

> What about the scenario where we want to examine a limited set? e.g. "-r
> not public() & externalset()". Should the external process read candidate
> revisions from stdin and filter so it doesn't do too much work? I /think/
> this behavior could be optional.

This is harder as it brings us into classic read/write deadlock territory.
There's also the issue that we could be making the problem worse: the not
public() set could be much larger than our externalset(). 

Most people will be able to practically work around most of the overhead here by
doing things like only returning the latest 1k results or pre-filtering stuff
that's public on the server. 

> > Also, we should allow data sources that are arbitrary URLs:
> > 
> >  [extrevset]
> >  tested = url:http://build.corp.example.com/hg-tested.dat
> >  good = url:http://build.corp.example.com/hg-passed.dat
> >  deployed = url:http://prod.example.com/hg-deployed.cgi
> >  fulltext = url:http://hg
> > -fulltext-db.example.com/query?string=$1
> > 
> > ..which will allow very easy integration with complex production
> > automation. The
> > url: piece might be redundant here? We might also allow calling Python,
> > similar
> > to how we allow it in hooks.
> > 
> That's really hot. We probably want to bikeshed HTTP semantics a bit. e.g.
> how do you differentiate between an empty result and a server error. HTTP
> status code?

Yes, my thought is stay really simple here. Exception raised reading results =
abort to fail safe. We don't want writing deploy scripts that silently update
back to null because their buildbot goes down. But if there's no error raised..
well, an empty set is valid.

>  If the server streams results via chunked transfer and hits a
> server error mid stream, how do we detect that (HTTP 200 has already been
> issued). Do we need some kind of light protocol in the content stream? That
> would be nice. But it does take away the simplicity. I suppose it could be
> optional for those wishing to opt into stronger guarantees. e.g.
> "//HGREVSETBEGIN\nrev0\nrev1\n//HGREVSETEND."

Again, this could potentially send more data over the wire.

> As cool as this sounds, the security and latency aspects of this scare me.

Indeed. For hgweb, you almost certainly want to either be reading a static file
or doing a pretty lightweight local db query.

> But since this is something you have to explicitly configure, a server
> operator can make reasonable judgements, I suppose.

Right.

> Passing unbound lists of revisions into HTTP seemingly requires a custom
> protocol or sending multiple requests.

Yep, we already know this is hard from our wire protocol.

> Also, new process creation on Windows is ~10x slower than POSIX systems. If
> you spawn hundreds or even dozens of processes on Windows you are going to
> have a bad time. Process re-use would be really nice. But I worry that
> requires too much of a "protocol" and raises the barrier to entry too much.
> Perhaps we need separate "namespaces" for processes that talk a protocol
> versus ones that do single process per item?

Having the http protocol opens up the possibility of just talking http to a
long-lived local server over a well-defined protocol. Then it just becomes a
matter of arranging for that server to be running..

-- 
Mathematics is the supreme nostalgia of our time.