[PATCH 06 of 11] internals: document compression negotiation

Mon Nov 21 17:46:35 EST 2016

On Sun, Nov 20, 2016 at 02:23:43PM -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1479679271 28800
> #      Sun Nov 20 14:01:11 2016 -0800
> # Node ID 952478a50f2583be4400c0f6fcc156d73d46711c
> # Parent  8d1b65503e8b360dd5121488f31d52a3587a0819
> internals: document compression negotiation

Lots of comments here. I've taken patches 1-5, and will review the
rest of the series before stopping.

>
> As part of adding zstd support to all of the things, we'll need
> to teach the wire protocol to support non-zlib compression formats.
>
> This commit documents how we'll implement that.
>
> To understand how we arrived at this proposal, let's look at how
> things are done today.
>
> The wire protocol today doesn't have a unified format. Instead,
> there is a limited facility for differentiating replies as successful
> or not. And, each command essentially defines its own response format.
>
> A significant deficiency in the current protocol is the lack of
> payload framing over the SSH transport. In the HTTP transport,
> chunked transfer is used and the end of an HTTP response body (and
> the end of a Mercurial command response) can be identified by a 0
> length chunk. This is how HTTP chunked transfer works. But in the
> SSH transport, there is no such framing, at least for certain
> responses (notably the response to "getbundle" requests). Clients
> can't simply read until end of stream because the socket is
> persistent and reused for multiple requests. Clients need to know
> when they've encountered the end of a request but there is nothing
> simple for them to key off of to detect this. So what happens is
> the client must decode the payload (as opposed to being dumb and
> forwarding frames/packets). This means the payload itself needs
> to support identifying end of stream. In some cases (bundle2), it
> also means the payload can encode "error" or "interrupt" events
> telling the client to e.g. abort processing. The lack of framing
> on the SSH transport and the transfer of its responsibilities to
> e.g. bundle2 is a massive layering violation and a wart on the
> protocol architecture. It needs to be fixed someday by inventing a
> proper framing protocol.

Love this paragraph. It's a huge loss that the framing delegation
happened, because it means the existing batch() method isn't streaming
over ssh (but it is over http).

>
> So about compression.
>
> The client transport abstractions have a "_callcompressable()"
> API. This API is called to invoke a remote command that will
> send a compressable response. The response is essentially a
> "streaming" response (no framing data at the Mercurial layer)
> that is fed into a decompressor.
>
> On the HTTP transport, the decompressor is zlib and only zlib.
> There is currently no mechanism for the client to specify an
> alternate compression format. And, clients don't advertise what
> compression formats they support or ask the server to send a
> specific compression format. Instead, it is assumed that non-error
> responses to "compressable" commands are zlib compressed.
>
> On the SSH transport, there is no compression at the Mercurial
> protocol layer. Instead, compression must be handled by SSH
> itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
> compression).
>
> For the HTTP transport, adding new compression formats is pretty
> straightforward. Once you know what decompressor to use, you can
> stream data into the decompressor until you reach a 0 size HTTP
> chunk, at which point you are at end of stream.
>
> So our wire protocol changes for the HTTP transport are pretty
> straightforward: the client and server advertise what compression
> formats they support and an appropriate compression format is
> chosen. We introduce a new HTTP media type to hold compressed
> payloads. The first 2 bytes of the payload define the compression
> format being used. Whoever is on the receiving end can sniff the
> first 2 bytes and handle the remaining data accordingly.
>
> Support for multiple compression formats is advertised on both
> server and client. The server advertises a "compression" capability
> saying which compression formats it supports and in what order they
> are preferred. Clients advertise their support for multiple
> compression formats via the HTTP "Accept" header.
>
> Strictly speaking, servers don't need to advertise which compression
> formats they support. But doing so allows clients to fail fast if
> they don't support any of the formats the server does. This is useful
> in situations like sending bundles, where the client may have to
> perform expensive computation before sending data to the server.
>
> By advertising compression support on each request in the "Accept"
> header and by introducing a new media type, the server is able
> to gradually transition existing commands/responses to use compression,
> even if they don't do so today. Contrast with the old world, where
> "application/mercurial-0.1" may or may not use zlib compression
> depending on the command being called. Compression is defined as
> part of "application/mercurial-0.2," so if a client supports this
> media type it supports compression.
>
> It's worth noting that we explicitly don't use "Accept-Encoding,"
> "Content-Encoding," or "Transfer-Encoding" for handling compression.
> People knowledgeable of the HTTP specifications will say that we
> should use these because compression is a media or transfer encoding,
> not a media type and dynamic compression is exactly what these
> headers should be used for. They have a point and I sympathize with
> the argument. However, my years of experience rolling out services
> leveraging HTTP has taught me to not trust the HTTP layer, especially
> if you are going outside the normal spec (such as using a custom
> "Content-Encoding" value to represent zstd streams). I've seen load
> balancers, proxies, and other network devices do very bad and
> unexpected things to HTTP messages (like insisting zlib compressed
> content is decoded and then re-encoded at a different compression level
> or even stripping compression completely). I've found that the best
> way to avoid surprises when writing protocols on top of HTTP is to use
> HTTP as a dumb transport as much as possible to minimize the chances
> that an "intelligent" agent between endpoints will muck with your data.

Totally agreed on this front, in case others have qualms. Experience
with git's smart-http protocol in the wild has motivated me to never
trust invisible intermediate proxies to do reasonable things. I've got
a variety of war stories debugging supposed bugs in the old google
code stack only to discover that it was an http proxy problem between
us and the client.

> While the widespread use of TLS is mitigating many intermediate
> network agents interfering with HTTP, there are still problems at the
> edges, with e.g. the origin HTTP server needing to convert HTTP to and
> from WSGI and buggy or feature-lacking HTTP client implementations.
> I've found the best way to avoid these problems is to avoid using
> headers like "Content-Encoding" and to bake as much logic as possible
> into media types and HTTP message bodies. The protocol changes in this
> commit do rely on the "Accept" and "Content-Type" headers. But we
> used them before, so we shouldn't be increasing our exposure to "bad"
> HTTP agents.
>
> What about SSH.

s/\./?/

>
> For the SSH transport, we can't easily implement content negotiation
> to determine compression formats because the SSH transport has no
> content negotiation capabilities today. And without a framing protocol,
> we don't know how much data to feed into a decompressor. So in order
> to implement compression support on the SSH transport, we'd need to
> invent a mechanism to represent content types and an outer framing
> protocol to stream data robustly. While I'm fully capable of doing
> that, it is a lot of work and not something that should be undertaken
> lightly.

> My opinion is that if we're going to change the SSH transport
> protocol, we should take a long hard look at implementing a grand
> unified protocol that attempts to address all the deficiencies with
> the existing protocol.

Yes, totally agreed.

> While I want this to happen, that would be
> massive scope bloat standing in the way of zstd support. So, I've
> decided to take the easy solution: the SSH transport will not gain
> support for multiple compression formats. Keep in mind it doesn't
> support *any* compression today. So essentially nothing is changing
> on the SSH front.

This sounds like a reasonable approach here. I'd like to get a clean
re-do on the wire protocol some day, but I suspect it'll be many moons
before someone is willing to pay for that work (and it seems unlikely
I'll get around to it for the laughs).

I wonder if it'd be reasonable to have an sshv2 protocol just *be*
http-over-ssh via stdin/stdout? I feel a touch dirty even suggesting
it, but the framing rules etc are already there...

>
> diff --git a/mercurial/help/internals/wireprotocol.txt b/mercurial/help/internals/wireprotocol.txt
> --- a/mercurial/help/internals/wireprotocol.txt
> +++ b/mercurial/help/internals/wireprotocol.txt
> @@ -68,8 +68,16 @@ Example HTTP requests::
>  The ``Content-Type`` HTTP response header identifies the response as coming
>  from Mercurial and can also be used to signal an error has occurred.
>
> -The ``application/mercurial-0.1`` media type indicates a generic Mercurial
> -response. It matches the media type sent by the client.
> +The ``application/mercurial-*`` media types indicate a generic Mercurial
> +data type.
> +
> +The ``application/mercurial-0.1`` media type is raw Mercurial data.

Perhaps the word legacy wants to be in this statement.

> +
> +The ``application/mercurial-0.2`` media type is compression framed Mercurial
> +data. The first 2 bytes of the payload indicate the compression format
> +used. The remaining bytes are compressed according to that compression
> +format. The decompressed data behaves the same as with
> +``application/mercurial-0.1``.
>
>  The ``application/hg-error`` media type indicates a generic error occurred.
>  The content of the HTTP response body typically holds text describing the
> @@ -81,15 +89,19 @@ type.
>  Clients also accept the ``text/plain`` media type. All other media
>  types should cause the client to error.
>
> +Behavior of media types is further described in the ``Content Negotiation``
> +section below.
> +
>  Clients should issue a ``User-Agent`` request header that identifies the client.
>  The server should not use the ``User-Agent`` for feature detection.
>
> -A command returning a ``string`` response issues the
> -``application/mercurial-0.1`` media type and the HTTP response body contains
> -the raw string value. A ``Content-Length`` header is typically issued.
> +A command returning a ``string`` response issues a
> +``application/mercurial-0.*`` media type and the HTTP response body contains
> +the raw string value (after compression decoding, if used). A
> +``Content-Length`` header is typically issued, but not required.
>
> -A command returning a ``stream`` response issues the
> -``application/mercurial-0.1`` media type and the HTTP response is typically
> +A command returning a ``stream`` response issues a
> +``application/mercurial-0.*`` media type and the HTTP response is typically
>  using *chunked transfer* (``Transfer-Encoding: chunked``).
>
>  SSH Transport
> @@ -233,6 +245,29 @@ 2006).
>  This capability was introduced at the same time as the ``lookup``
>  capability/command.
>
> +compression
> +-----------
> +
> +Declares support for negotiating compression formats.
> +
> +Presence of this capability indicates the server supports dynamic selection
> +of compression formats based on the client request.
> +
> +Servers advertising this capability are required to support the
> +``application/mercurial-0.2`` media type in response to commands returning
> +streams. Servers may support this media type on any command.
> +
> +The value of the capability is a comma-delimited list of strings declaring
> +supported compression formats. The order of the compression formats is in
> +server-preferred order, most preferred first.
> +
> +The compression format strings are 2 byte identifiers. These are the same
> +2 byte *header* values at the beginning of ``application/mercurial-0.2``
> +media types (as used by the HTTP transport).
> +
> +This capability was introduced in Mercurial 4.1 (released February
> +2017).

Mention that as of that release it was not yet used over the ssh
transport? Or state that it was only used over http?

> +
>  getbundle
>  ---------
>
> @@ -416,6 +451,46 @@ Mercurial server replies to the client-i
>  not conforming to the expected command responses is assumed to be not related
>  to Mercurial and can be ignored.
>
> +Content Negotiation
> +===================
> +
> +The wire protocol has some mechanisms to help peers determine what content
> +types and encoding the other side will accept. Historically, these mechanisms
> +have been built into commands themselves because most commands only send a
> +well-defined response type and only certain commands needed to support
> +functionality like compression.
> +
> +Currently, only the HTTP transport supports content negotiation at the protocol
> +layer.
> +
> +HTTP requests advertise accepted media types via the ``Accept`` header.
> +
> +All clients should advertise an ``application/mercurial-0.1`` value.
> +
> +Clients supporting it can also advertise ``application/mercurial-0.2``.
> +This media type supports the ``comp`` parameter to declare which compression
> +formats the client accepts. The value is a ``quoted-string`` (defined by
> +HTTP specification) containing a space-delimited list of 2 byte compression
> +format identifiers. e.g. ``application/mercurial-0.2; comp="ZS ZL UN"``.
> +If the ``comp`` parameter is absent, the server interprets this as equivalent
> +to ``ZL UN``.
> +
> +Clients may choose to only advertise the ``application/mercurial-0.2`` media
> +type if the server advertises the ``compression`` capability.
> +
> +A server that doesn't receive an ``Accept`` header listing any
> +``application/mercurial-*`` values should infer that
> +``application/mercurial-0.1`` was sent, as this media type should be supported
> +by all clients ever written.

I'd like to be more cautious in the wording here, and give servers
room to reject old clients for not understanding
application/mercurial-0.2. It's not hard for me to envision a future
where someone writes a modern-proto-only client in
Java/Go/Rust/Piet/whatever, and it'd be nice to have a defined way for
the client and server to identify each other in that case. I also want
to be able to run servers that intentionally lock out old clients
(potentially for scaling reasons, since we're talking about
compression performance).

> +
> +A server receiving multiple ``application/mercurial-*`` values may choose any
> +of them. For example, a server may issue ``application/mercurial-0.2`` only
> +for responses that it chooses to compress.
> +
> +A server may issue ``application/hg-*`` media types even though the client
> +does not specify support for them in an ``Accept`` header. This is for
> +backwards compatibility reasons.

Can we provide guidance here that new servers should not issue these
forms? Other than hg-error, we don't send them today, so I'd like to
put a stop to any growth of that now (especially
application/hg-changegroup, which looks super crufty and ancient).

> +
>  Commands
>  ========
>