[PATCH 1 of 5] bundle2: very first version of a bundle2 bundler

Gregory Szorc gregory.szorc at gmail.com
Wed Mar 19 22:34:53 CDT 2014


On 3/19/14, 3:02 PM, pierre-yves.david at ens-lyon.org wrote:
> # HG changeset patch
> # User Pierre-Yves David <pierre-yves.david at fb.com>
> # Date 1395176450 25200
> #      Tue Mar 18 14:00:50 2014 -0700
> # Node ID f77651bc816e79ae36b17641f13194edb92e5dcf
> # Parent  29b159bd71bc68e9868c5d2d748ab166dc7a5287
> bundle2: very first version of a bundle2 bundler
>
> This changeset is the very first of a long series. It create a new bundle2
> module and add a simple class that generate and empty bundle2 container.
>
> The module is documented with the current state of the implementation. For
> information about the final goal you may want to consult the mercurial wiki
> page:
>
>      http://mercurial.selenic.com/wiki/BundleFormat2
>
> The documentation of the module will be updated with later patches adding more and
> more feature to the format.
>
> This patches also introduce a test case. This test case build and use its own
> small extension that use the new bundle2 module. Since the new format is unable
> to do anything right now, we could not use real mercurial code to test it.
> Moreover, some advanced feature of the bundle2 spec will not be used by core
> mercurial at all. So we need to have them tested.
>
> diff --git a/mercurial/bundle2.py b/mercurial/bundle2.py
> new file mode 100644
> --- /dev/null
> +++ b/mercurial/bundle2.py
> @@ -0,0 +1,84 @@
> +# bundle2.py - generic container format to transmit arbitrary data.
> +#
> +# Copyright 2013 Facebook, Inc.
> +#
> +# This software may be used and distributed according to the terms of the
> +# GNU General Public License version 2 or any later version.
> +"""Handling of the new bundle2 format
> +
> +The goal of bundle2 is to act as an atomically packet to transmit a set of
> +payloads in an application agnostic way. It consist in a sequence of "parts"
> +that will be handed to and processed by the application layer.
> +
> +
> +General format architecture
> +===========================
> +
> +The format is architectured as follow
> +
> + - magic string
> + - stream level parameters
> + - payload parts (any number)
> + - end of stream marker.
> +
> +The current implementation is limited to empty bundle.
> +
> +Details on the Binary format
> +============================
> +
> +All numbers are unsigned and big endian.
> +
> +stream level parameters
> +------------------------
> +
> +Binary format is as follow
> +
> +:params size: (16 bits integer)
> +
> +  The total number of Bytes used by the parameters
> +
> +  Currently force to 0.
> +
> +:params value: arbitrary number of Bytes
> +
> +  A blob of `params size` containing the serialized version of all stream level
> +  parameters.
> +
> +  Currently always empty.
> +
> +
> +Payload part
> +------------------------
> +
> +Binary format is as follow
> +
> +:header size: (16 bits inter)
> +
> +  The total number of Bytes used by the part headers. When the header is empty
> +  (size = 0) this is interpreted as the end of stream marker.
> +
> +  Currently forced to 0 in the current state of the implementation
> +"""

This is my first time really looking at bundle format 2. I want it to be 
a good and robust format. So, I put together an informal review of the 
format. I'm assuming this is the appropriate place to post it (I don't 
see any threads discussing it specifically - perhaps my memory and 
searching are failing me).

First, the composition of the bundle as a stream of extensible message 
types is really nice. It should adapt to serve Mercurial's needs - both 
for the core and for extensions - for the foreseeable future. If one 
wanted to go down the road of turning it into the next generation of the 
wire protocol, it's just a few changes away from fulfilling that need as 
well. I don't want the awesomeness of this feature to get lost in the 
criticisms that follow.

The content on http://mercurial.selenic.com/wiki/BundleFormat2 differs 
significantly from what's documented/implemented in this patch series! 
I'm not sure what I should critique, so I'll hit both.

As implemented, you have 16 bits for parameters/headers. The wiki says 
32. I think 64k for parameters/headers seems a bit restricting. You 
never know what people are going to cram in there. Why not go with 32? 
That being said, we should be discouraging parameters/headers in favor 
of separate message types, IMO, so maybe 64k is enough...

The wiki says there is a main header with a byte bundle version set to 
0. This patch has a magic string. Which one is it? Why use 0 instead of 
2 for the bundle version?

I think using spaces for delimiters in parameter encoding is somewhat 
silly for a binary protocol. The proposal on the wiki of using 
tightly-packed structs seems much more reasonable, IMO. What if you want 
to use a space in a parameter name or value? (Yes, existing wire 
protocol exchange uses space delimiting. I think that is a mistake.)

It's unclear to me why we need separate framing for the global 
parameters/headers and per-part metadata. Can't they share the same 
framing. This is how HTTP 2.0 does things. See 
http://tools.ietf.org/html/draft-ietf-httpbis-http2-10#section-4 for 
reference. If we really need global parameters, we could require the 
first frame in a stream to be of a certain type.

The wiki makes reference to a "stream" mode and a "plain" mode. I'm not 
sure why we need both. If you frame all messages and each message has a 
type, "stream" vs "plain" can be modeled by how different types 
interact. e.g. "start-changelog" followed by "continuation" would 
accomplish the same thing as "stream," without wasting a byte. Again, 
you could probably adopt some of HTTP 2.0's ideas here.

How exactly does compression work in this bundle format? Does each 
payload part gets its own context? Do they share contexts? What about 
headers, part type names, etc? These often have a lot of duplicated 
strings. Do they not get the benefits of compression because they aren't 
in the [compressible] payload? Should we add support for a compression 
context that operates on just the header? How will compression 
interact/interfere with the streaming of already-compressed deltas, such 
as revlogs?

The encoding of parts options according to the wiki seems to imply more 
space-delimited encoding. It's a binary protocol: talk structs, not 
strings. Don't leave the door open to escaping issues.

Why is there an end-of-stream marker? Can't that be represented as an 
empty frame or frame with an EOF bit set? Think about how clients must 
implement this. Slurping bytes looking for an end-of-stream marker is 
overhead and may be ambiguous, especially if that marker can appear in 
the header of the next frame! For reference, HTTP 2.0 has "end of 
segment" and "end of stream" bits in the frame.

I think that's good enough for a first pass informal review. What do you 
think?


More information about the Mercurial-devel mailing list