[PATCH 1 of 5] bundle2: very first version of a bundle2 bundler

Pierre-Yves David pierre-yves.david at ens-lyon.org
Wed Mar 19 23:01:38 CDT 2014



On 03/19/2014 08:34 PM, Gregory Szorc wrote:
> On 3/19/14, 3:02 PM, pierre-yves.david at ens-lyon.org wrote:
>> # HG changeset patch
>> # User Pierre-Yves David <pierre-yves.david at fb.com>
>> # Date 1395176450 25200
>> #      Tue Mar 18 14:00:50 2014 -0700
>> # Node ID f77651bc816e79ae36b17641f13194edb92e5dcf
>> # Parent  29b159bd71bc68e9868c5d2d748ab166dc7a5287
>> bundle2: very first version of a bundle2 bundler
>>
>> This changeset is the very first of a long series. It create a new
>> bundle2
>> module and add a simple class that generate and empty bundle2 container.
>>
>> The module is documented with the current state of the implementation.
>> For
>> information about the final goal you may want to consult the mercurial
>> wiki
>> page:
>>
>>      http://mercurial.selenic.com/wiki/BundleFormat2
>>
>> The documentation of the module will be updated with later patches
>> adding more and
>> more feature to the format.
>>
>> This patches also introduce a test case. This test case build and use
>> its own
>> small extension that use the new bundle2 module. Since the new format
>> is unable
>> to do anything right now, we could not use real mercurial code to test
>> it.
>> Moreover, some advanced feature of the bundle2 spec will not be used
>> by core
>> mercurial at all. So we need to have them tested.
>>
>> diff --git a/mercurial/bundle2.py b/mercurial/bundle2.py
>> new file mode 100644
>> --- /dev/null
>> +++ b/mercurial/bundle2.py
>> @@ -0,0 +1,84 @@
>> +# bundle2.py - generic container format to transmit arbitrary data.
>> +#
>> +# Copyright 2013 Facebook, Inc.
>> +#
>> +# This software may be used and distributed according to the terms of
>> the
>> +# GNU General Public License version 2 or any later version.
>> +"""Handling of the new bundle2 format
>> +
>> +The goal of bundle2 is to act as an atomically packet to transmit a
>> set of
>> +payloads in an application agnostic way. It consist in a sequence of
>> "parts"
>> +that will be handed to and processed by the application layer.
>> +
>> +
>> +General format architecture
>> +===========================
>> +
>> +The format is architectured as follow
>> +
>> + - magic string
>> + - stream level parameters
>> + - payload parts (any number)
>> + - end of stream marker.
>> +
>> +The current implementation is limited to empty bundle.
>> +
>> +Details on the Binary format
>> +============================
>> +
>> +All numbers are unsigned and big endian.
>> +
>> +stream level parameters
>> +------------------------
>> +
>> +Binary format is as follow
>> +
>> +:params size: (16 bits integer)
>> +
>> +  The total number of Bytes used by the parameters
>> +
>> +  Currently force to 0.
>> +
>> +:params value: arbitrary number of Bytes
>> +
>> +  A blob of `params size` containing the serialized version of all
>> stream level
>> +  parameters.
>> +
>> +  Currently always empty.
>> +
>> +
>> +Payload part
>> +------------------------
>> +
>> +Binary format is as follow
>> +
>> +:header size: (16 bits inter)
>> +
>> +  The total number of Bytes used by the part headers. When the header
>> is empty
>> +  (size = 0) this is interpreted as the end of stream marker.
>> +
>> +  Currently forced to 0 in the current state of the implementation
>> +"""
>
> This is my first time really looking at bundle format 2. I want it to be
> a good and robust format. So, I put together an informal review of the
> format. I'm assuming this is the appropriate place to post it (I don't
> see any threads discussing it specifically - perhaps my memory and
> searching are failing me).

No, we could probably make a thread of his own for generic design 
discussion..

> First, the composition of the bundle as a stream of extensible message
> types is really nice. It should adapt to serve Mercurial's needs - both
> for the core and for extensions - for the foreseeable future. If one
> wanted to go down the road of turning it into the next generation of the
> wire protocol, it's just a few changes away from fulfilling that need as
> well. I don't want the awesomeness of this feature to get lost in the
> criticisms that follow.
>
> The content on http://mercurial.selenic.com/wiki/BundleFormat2 differs
> significantly from what's documented/implemented in this patch series!
> I'm not sure what I should critique, so I'll hit both.

The BundleFormat2 page is lagging a bit behind. I had extensive 
discussion about it with David Soria Para and Matt Mackhall ealier and 
we made some adjustement.

I decided to start some coding as my coding/writing ratio was running 
low and that usually bad for my overall productivity.

> As implemented, you have 16 bits for parameters/headers. The wiki says
> 32. I think 64k for parameters/headers seems a bit restricting. You
> never know what people are going to cram in there. Why not go with 32?
> That being said, we should be discouraging parameters/headers in favor
> of separate message types, IMO, so maybe 64k is enough...

Yes, we reduced most size entry from 32 to 16 or 8 bits. re-increase of 
that may be discussed later before we actually release the format in the 
wild.

The main idea behind this reduction is that header data should see very 
few usage as people can had they own part. Unlike HTTP, bundle2 header 
are not used to transmit applicative data. They are restricted to 
"option about how to understand the payload"

>
> The wiki says there is a main header with a byte bundle version set to
> 0. This patch has a magic string. Which one is it? Why use 0 instead of
> 2 for the bundle version?

Magic string. "HG20" allow proper rejection of bundle2 file by old 
mercurial version.

>
> I think using spaces for delimiters in parameter encoding is somewhat
> silly for a binary protocol. The proposal on the wiki of using
> tightly-packed structs seems much more reasonable, IMO. What if you want
> to use a space in a parameter name or value? (Yes, existing wire
> protocol exchange uses space delimiting. I think that is a mistake.)

No, the wiki proposes tightly-packed structs for parts parameters only. 
It propose a similar text based format for the header.

The idea is that we can see very few valid usage for stream level 
paramaters and all of them should be simple enough to fit well into a 
text based format.

Both name and value are urlquoted in the final version, that should 
catch the few extra thing.

You can check it out here http://hg-lab.logilab.org/wip/hg


> It's unclear to me why we need separate framing for the global
> parameters/headers and per-part metadata. Can't they share the same
> framing. This is how HTTP 2.0 does things. See
> http://tools.ietf.org/html/draft-ietf-httpbis-http2-10#section-4 for
> reference. If we really need global parameters, we could require the
> first frame in a stream to be of a certain type.

The global header convey information about how the part should be 
retrieved (or very very high level data as debug flag). Must be read 
before we analyze any part. on the other hand is should be tiny.

> The wiki makes reference to a "stream" mode and a "plain" mode. I'm not
> sure why we need both. If you frame all messages and each message has a
> type, "stream" vs "plain" can be modeled by how different types
> interact. e.g. "start-changelog" followed by "continuation" would
> accomplish the same thing as "stream," without wasting a byte. Again,
> you could probably adopt some of HTTP 2.0's ideas here.

The mode idea have been dropped. The plain mode being a stream mode with 
a single chuck. With possible parameters to highlight this fact.

> How exactly does compression work in this bundle format? Does each
> payload part gets its own context? Do they share contexts? What about
> headers, part type names, etc? These often have a lot of duplicated
> strings. Do they not get the benefits of compression because they aren't
> in the [compressible] payload? Should we add support for a compression
> context that operates on just the header? How will compression
> interact/interfere with the streaming of already-compressed deltas, such
> as revlogs?

No clear plan here, but the current format should allow both bundle 
level compression (one of the valid usecase for stream parameter) and 
part level compression.

> The encoding of parts options according to the wiki seems to imply more
> space-delimited encoding. It's a binary protocol: talk structs, not
> strings. Don't leave the door open to escaping issues.

This encoding stuff is an -example- of what would qualify as a stream 
option. I do not encourage people to play with the encoding of part 
header. (Apparently the encoding used as example was not enough a hint 
;-) ). The same way, I do not expect anyone to switch to middle endian 
for the stream (I do not expect anyone to even use middle endian)

> Why is there an end-of-stream marker? Can't that be represented as an
> empty frame or frame with an EOF bit set? Think about how clients must
> implement this. Slurping bytes looking for an end-of-stream marker is
> overhead and may be ambiguous, especially if that marker can appear in
> the header of the next frame! For reference, HTTP 2.0 has "end of
> segment" and "end of stream" bits in the frame.

There is no special end-of-stream markers. The end-of-stream marker a 
part with an empty header. A single (probable) 16bits zero where people 
would find another part.


Does that address all your concern?

-- 
Pierre-Yves


More information about the Mercurial-devel mailing list