Review of bundle2: binary format

Tue Dec 23 15:44:31 CST 2014

On 12/19/14 3:47 PM, Pierre-Yves David wrote:
>
>
> On 12/18/2014 10:26 PM, Gregory Szorc wrote:
>> Pierre-Yves sent out a call to action to review the bundle2 format. Here
>> are my thoughts on... the binary format.
>
>> encoding of stream and part parameters
>> --------------------------------------
>>
>> I'm not a huge fan of either the stream level or part parameters
>> encodings.
>>
>> Stream level parameters are string-based and use "<name>=<value>" where
>> <name> and <value> are URL-encoded. The presence of binary size framing
>> already makes bundle2 a binary-based protocol (as opposed to say HTTP's
>> chunked-transfer encoding, which uses base 10 formatted integers to
>> delimit chunk lengths). I see little value in jumping through hoops to
>> make stream level parameters human readable. Let's aim to pack them as
>> tightly as possible, without having to encode the strings. The result
>> will be smaller and will be cheaper to process.
>
> If we end up have issue with the cost and size of stream level
> parameter, we really are in trouble. However the argument for binary
> encoding and convergence with part parameters sounds good to me.

You don't know what the future holds. And protocols are hard to change. 
Down below, we talk about limiting frame sizes to 64k. For 1 GB of repo 
data, that's 16k frames. Every 1 MB = 16k frames x 64 bytes. On initial 
glance, that doesn't seem bad. But I don't know what all text will 
appear in "repo data continuation" frames.

>>
>> I like that the part parameters are encoded in binary and packed
>> efficiently. What I don't like is the packing format. Currently:
>>
>> * 1 byte # of mandatory parameters
>> * 1 byte # of advisory parameters
>> * N * 2 bytes of parameter name-value sizes
>> * N blobs of the raw (name, value) data
>>
>> What irks me the most is the "distance" between the parameter name/value
>> sizes and the raw data. Typically in framed protocols you have the size
>> immediately followed by the data. This common approach keeps readers and
>> writers simpler: they are very simple state machines that slurp single
>> frames at a time. As implemented, readers and writers need to pre-load a
>> bunch of data/sizes and iterate. This is needless state and complexity.
>> I'd rather see consecutive frames of parameters. e.g.
>
> The advantage of the current format is that it requires a minimal amount
> of reads, you can know the exact amount of data to read in 2 unpacking.

To clarify, # of reads is the same: it is # of accesses that increases. 
The whole frame/header is presumably pre-fetched from the wire. No extra 
system calls here. I don't think there is a significant performance 
difference between these two proposals.

> Having the parameters parsing stream-able does not seems valuable, as
> none of them are going to be processed until the whole part header is
> decoded.
>
> (so I'm not super convinced by this point)

I'm not arguing for stream-able over-the-wire parameter parsing: I'm 
arguing for a protocol that makes reading and writing the payload 
simpler (while not sacrificing performance).

>> efficiency of encoding of part types and parameters
>> ---------------------------------------------------
>>
>> The protocol uses strings to identify part types and their parameter
>> names and values. Presumably a lot of parts and their parameters will be
>> common and repeated.
>>
>> HTTP experienced pains with string metadata (headers) and they have
>> devised a mechanism for compressing headers in HTTP/2
>> (https://http2.github.io/http2-spec/compression.html).
>
> I'm not sure if this will be an issue. The main size of the repo should
> live in payload anyway.

Again, you don't know what the future holds. This wasn't initially an 
issue with HTTP either. But then internet pipes became fatter, browsers 
became more powerful, and pages started containing dozens or hundreds of 
resources.

Mercurial != HTTP, so yeah, we probably don't need to worry so much. I 
just wanted to raise this as an idea.