Wire protocol futures

Wed Sep 5 12:00:23 EDT 2018

On Wed, Sep 5, 2018 at 8:24 AM Josef 'Jeff' Sipek <jeffpc at josefsipek.net>
wrote:

> There is a lot of info here, thanks for the write up!
>
> On Fri, Aug 31, 2018 at 15:47:34 -0700, Gregory Szorc wrote:
> ...
> > Assuming you only have primitive data retrieval commands, you are now
> > issuing a lot more commands.
>
> While I'm all for allowing simpler servers (and hopefully clients too), I'm
> worried about the chattiness of such a protocol - specifically the number
> of
> network round-trips that depend on previous commands completing.
>
> Over the years, I've seen plenty of protocols evolve to reduce chattiness.
> For example, NFSv4 added compounds - a way to pack up several RPCs and send
> them as a unit, SMB/CIFS reduced the number of RPCs, and so on.  I realize
> that both those examples are file systems, but I'd argue that their lessons
> apply here as well.
>
> Somewhat relatedly:  The jmap IETF working group [1] is working on a new
> way
> to access email - ideally replacing IMAP.  The interesting thing here is
> that the entire design is visibly targetting high latency links.
> (Personally, I think this is because the authors are from Australia and
> therefore they are very sensitive to latency.)  I don't know if there are
> any lessons in jmap that would apply here, but I would certainly encourage
> testing on high-latency & high-bandwidth links if there is any concern of
> chattiness in the new protocol.
>
> [1] https://datatracker.ietf.org/group/jmap/about/

I agree about the concerns around network round-trips and if we limit the
implementation to the set of commands I outlined, we will have problems on
high-latency networks.

I fully anticipate implementing supplemental commands which return larger
sets of data, both to mitigate round trip overhead and the amount of data
that clients need to send to the server. I'd like to think there is a
middle ground between the low-level commands and "getbundle" that is
friendlier to resumable clone, caching, etc. But if we need to implement a
command that returns all of the data for performance reasons (this command
may be "getbundle"), then so be it.

>
>
> ...
> > At the end of the day, the wire protocol command set will be driven by
> > practical needs, not by ivory tower architecting. We'll see what
> shortcuts
> > we need to employ in the name of performance and we'll implement them.
>
> That's good to hear.  I just hope that these "bonus" commands will fit more
> or less nicely into the new protocol design.  It'd be rather unfortunate if
> in the process of adding these bonus commands you reinvented getbundle.
>

Believe me, concern about reinventing "getbundle" and bundle2 has been on
my mind a lot :)

>
> ...
> > Since we are effectively talking about a new VCS at the wire protocol
> > level, let's talk about other crazy ideas. As Augie likes to say, once we
> > decide to incur a backwards compatibility break, we can drive a truck
> > through it.
> >
> > Let's talk about hashes.
> >
> > Mercurial uses SHA-1 for content indexing. We know we want to transition
> > off of SHA-1 eventually due to security weaknesses.
> ...
> > In addition, Mercurial has 2 ways to store manifests: flat and tree.
> ...
> >
> > One of the ideas I'm exploring in the new wire protocol is the idea of
> > "hash namespaces." Essentially, the server's capabilities will advertise
> > which hash flavors are supported. Example hash flavors could be
> > "hg-sha1-flat" for flat manifests using SHA-1 and "hg-blake2b-tree" for
> > tree manifests using blake2b. When a client makes a request, that request
> > will be associated with a "hash namespace" such that any nodes referenced
> > by that command are in the requested "hash namespace."
>
> While this idea is intriguing, it also means AFAICT that a changeset no
> longer has one globally unique ID.  E.g., consider the world where there
> are:
>
>         hg-sha256-flat
>         hg-blake2b-flat
>
> or:
>
>         hg-blake2b-flat
>         hg-blake2b-tree
>
> In both cases, the node id will be 32 bytes/64 hex chars long.  I can no
> longer paste at you a hash I see in 'hg log' and (1) know what hash
> function
> generated it, and (2) be certain that you can grep your 'hg log' output for
> it and find it.  This whole thing gets even more fun when you share
> abbreviated hashes - e.g., abc may be the shortest unique node prefix in
> both namespaces, but may map to completely different revisions.
>
> As a side note, wouldn't it be possible to deal with flat<->tree
> transitions
> by making a "dummy" commit that rewrites the manifest to the new format and
> sets some flag in .hg/requires?
>
> Anyway, as intriguing as this idea is, I'm skeptical that the resulting UX
> will be good.  It also possible that I'm not fully understanding your idea
> here :)
>

I agree that the UI problems are concerning. I haven't fully thought
through them myself. From where I sit, I'm mostly concerned with building
an adaptable wire protocol and set of commands that's flexible for the next
10+ years. I think "hash namespaces" has the potential to solve a lot of
problems via server adaptability. Without them, every new hashing scheme is
a one-off and it makes transitions and experimenting much more difficult.
For example, with "hash namespaces" you could expose Git-indexed data
relatively easily. Without them, there is no obvious solution. And just
because the server feature exists doesn't mean we need to expose it to the
client.

I'm optimistic that by throwing this idea out there that others can think
about potential UX. e.g. clients could retrieve all hash namespaces and
automatically recognize which "namespace" a user-provided hash comes from
and convert among them transparently. I also agree that the lack of a
globally unique ID is concerning. Unfortunately, globally unique IDs seem
at odds with content-indexed IDs. There was talk in IRC ~22 hours ago about
using UUIDs for globally unique IDs. Apparently this isn't a new idea..

>
> > This feature, if implemented, would allow a server/repository to index
> and
> > serve data under multiple hashing methodologies simultaneously. For
> > example, pushes to the repository would be indexed under SHA-1 flat,
> SHA-1
> > tree, blake2b flat, and blake2b tree. Assuming the server operator opts
> > into this feature, new clones would use whatever format is
> > supported/recommended at that time. Existing clones would continue to
> > receive SHA-1 flat manifests. New clones would receive blake2b tree
> > manifests.
>
> See above about UX.
>
> Regardless, it is certainly something to experiment with and either keep or
> throw away.
>
> Thanks for all the work you've put in,
>
> Jeff.
>
> --
> Once you have their hardware. Never give it back.
> (The First Rule of Hardware Acquisition)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20180905/6db30556/attachment.html>