[RFC] external revsets and templates

Thu Jul 14 02:50:31 EDT 2016

On Wed, Jul 13, 2016 at 11:37 AM, Matt Mackall <mpm at selenic.com> wrote:

> These are features we've been playing with a bit in the new review flow:
>
>
> https://www.mercurial-scm.org/wiki/AcceptProcess#Setting_up_the_revset_helper
>
> These have been extremely useful and I'd like to make them core feature,
> so I'd
> like to further iron out the syntax and feature set before moving forward.
>
> Currently, external revsets works like this:
>
>  [extrevset]
>  foo = shell:some-shell-command
>
> Then some-shell-command is expected to return a series of Mercurial
> identifiers
> (hash, rev, tag..), one per line. When "foo" is used in a revset, Mercurial
> calls the shell command, looks up each result, and returns a corresponding
> revset.
>
> I think we should also be able to support arguments:
>
>  [extrevset]
>  cvs = shell:/path/to/lookup-cvs-rev $1
>
> Then we can do:
>
>  $ hg log -r "cvs(123)"
>

Cool idea with lots of potential for quick hacks.

One item we may want to bikeshed is scaling. We know from existing revsets
that lazy evaluation helps.

Presumably we could lazy read process output so e.g. if we're only
interested in 10 items the command doesn't spend a long time printing 1M
revisions. I /think/ by keeping the size of the inter-process pipe in check
we can cause premature blocking in the invoked process to throttle how much
work it is doing.

What about the scenario where we want to examine a limited set? e.g. "-r
not public() & externalset()". Should the external process read candidate
revisions from stdin and filter so it doesn't do too much work? I /think/
this behavior could be optional.

>
> Also, we should allow data sources that are arbitrary URLs:
>
>  [extrevset]
>  tested = url:http://build.corp.example.com/hg-tested.dat
>  good = url:http://build.corp.example.com/hg-passed.dat
>  deployed = url:http://prod.example.com/hg-deployed.cgi
>  fulltext = url:http://hg
> -fulltext-db.example.com/query?string=$1
>
> ..which will allow very easy integration with complex production
> automation. The
> url: piece might be redundant here? We might also allow calling Python,
> similar
> to how we allow it in hooks.
>

That's really hot. We probably want to bikeshed HTTP semantics a bit. e.g.
how do you differentiate between an empty result and a server error. HTTP
status code? If the server streams results via chunked transfer and hits a
server error mid stream, how do we detect that (HTTP 200 has already been
issued). Do we need some kind of light protocol in the content stream? That
would be nice. But it does take away the simplicity. I suppose it could be
optional for those wishing to opt into stronger guarantees. e.g.
"//HGREVSETBEGIN\nrev0\nrev1\n//HGREVSETEND."

>
> My current implementation has no caching, which is usually fine. My plan
> is to
> cache the non-argument version for the repo object lifetime and leave the
> argument version uncached, but the chg use case might need a better plan.
>
>
> External templates are very similar and allow adding data to the display
> side
> (including in hgweb!).

As cool as this sounds, the security and latency aspects of this scare me.
But since this is something you have to explicitly configure, a server
operator can make reasonable judgements, I suppose.

> Instead of simply getting a list of revisions, it gets a
> list of revision[space]description pairs. For instance, I can currently
> get a
> list of reviewers on draft changesets thusly:
>
>  [exttemplate]
>  reviewers = shell:ssh mercurial-cm accept/reviewed
>
> ..and simply add {reviewers} to my log template. Again, this can be used
> for
> many things, like displaying number of test failures, deployment status,
> mappings to other SCMs or review tools.
>
> Caching here is more important as templates get evaluated once per
> changeset. My
> current hack keeps a global cache, but caching per repo is probably saner.
>

Passing unbound lists of revisions into HTTP seemingly requires a custom
protocol or sending multiple requests.

Also, new process creation on Windows is ~10x slower than POSIX systems. If
you spawn hundreds or even dozens of processes on Windows you are going to
have a bad time. Process re-use would be really nice. But I worry that
requires too much of a "protocol" and raises the barrier to entry too much.
Perhaps we need separate "namespaces" for processes that talk a protocol
versus ones that do single process per item?

> Because the data format for external templates is a superset of the one
> used by
> external revsets, the same source can probably be shared in the cases
> where it
> makes sense.
>
> Thoughts?
>

This is a really cool idea that will allow people to extend Mercurial's
querying and formatting abilities (2 major selling points over e.g. Git)
without requiring an extension. That's huge. The existing proposal should
work well on most repos. Of course, I have to support a very large repo and
Windows, so I naturally have scaling concerns. But I suppose if you need
the perf you can write an extension.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160713/d46dc898/attachment.html>