Differences between revisions 18 and 19
Revision 18 as of 2014-03-11 20:15:21
Size: 4892
Comment:
Revision 19 as of 2014-03-16 09:21:39
Size: 12906
Comment: add draft specification for the new bundle2 format
Deletions are marked like this. Additions are marked like this.
Line 120: Line 120:
= Format of the Bundle2 Containeri =

== Goal ==

The goal of bundle2 is to act as an atomically packet to transmit a set of
payloads in an application agnostic way. It consist in a sequence of "parts"
that will be handed to and processed by the application layer.

A bundle2 can be read in a single pass from a stream.

bundle2 start with a small header and follow with a sequence of parts. Parts
have an header of they own.

== Main Header ==

This header contains information about the application agnostic bundle.

It is encoded as such:

 - bundle version (unsigned Byte)
 - main options:
   - size of main options (unsigned 32 bits integer)
   - main options (text)


Bundle version is a single unsigned Bytes used to know the version of bundle2
format used in this stream. It will be incremented when major upgrade of the
protocol happen

Current value is `0`.

unbundling MUST abort when an unknown protocol version is meet.

Note that abort from unknown protocol version are nasty as we do not know how much data
remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

=== Main Options ===

First come a 32 bits integer. Its the size in Bytes of the options themselves

The size of data in the main header. If people need more than 4GB of
options I expect them to be run in other troubles before.

The main header data are the list of options at alter the behavior of the top
level bundle. This is intended only to control extraction of the payload from
it. This is -not- intended for any changes in the application level
understanding of the payload. The list of option follow this rules

  - Content is pure alphanumerical list of option,
  - The list is space separated,
  - Option MUST start with a letter,
  - upper case option are mandatory,
  - lower case option MAY be disregarded,
  - mixed case options SHOULD never happen but will be interpreted as upper case,
  - option may be valuated using the forms optionname=value,
  - value are alphanumerical + `.`,

  - to summarise option are a space separated list of thing matching:
    `[A-Za-z][A-Za-z0-9]+(=[A-Za-z0-9.]+)`

Note that the first piece send is the size of the options section. So options
themselves cannot be stream. This is one more reason why you should not intend
to store huge data in main-option.

Note also that abort from unknown option are nasty as we do not know how much
data remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

==== Examples of valid main option ====

 - Set a new format of part headers:

   `PARTVERSION=1`

 - Have the payload use a special compression algorithm

   `COMPRESSION=DOGEZIP`

 - Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)

   `PARTENCODING=GOST13052`

 - Set integer format in part-header to middle-endian

   `ENDIANESS=PDP11`

==== Example of -possibly- valid main option ====

 - ask for debug level output in the reply

   `debug`

 - inform of total number of parts:

   `nbparts=42`

 - inform of total size of the bundle:

   `totalsize=1337`

==== Example of -invalid- main option ====

 - List of known heads (use a part for that)

 - username and/or credential (use a part for that)


== Parts ==

Parts convey the application level payload of the bundle. They are handled by
the application layer during the unbundle process.

A parts consist in three elements: type, options and data.

Type is a simple alphanumerical identifier that lets the application level know
what kind of data the part contains and root it to the applicable processors.

 - upper case type are mandatory and MUST be processed by the server. If the
   server does not know how to handle an upper case type it MUST abort the
   unbundle process.

 - lower case type are advisory and CAN be disregarded during the unbundle process.

 - mixed case SHOULD not appear and WILL be interpreted as upper case one.


Options are a set of key and value that may change the way the data from this
part will be processed. Some of them may be mandatory some other may be advisory

Data are the actual payload of the part.

=== Parts Header ===


 - size of header (32bits integer)
 - header:
   - size of type (Byte)
   - part type (string (up to 255 char))
   - part mode (Byte)
   - data size (when applicable)
   - options: (see other section)


Note that first entry is the full size of the header. So the header can't be
streamed and one should not plan to put massive data in the header itself.
(That what parts data are meant for).

The type is an alphanumerical string of arbitrary size (<256) that will be used
to find the application level part that process the data payload. It follow the
upper/lower case rules explained in the previous section. Note that routing
should be case insensitive. The lower case and upper case version of the same
type MUST be handled by the same code. It only matters in the case no handler
is found for a given type.

The mode is an enum that define how the data can be retrieved from the part.
The unbundle process MUST be aborted is an unknown mode is meet. The two
foreseen mode are now:

 - stream (0x0): total size of the data are yet unknown. See dedicated section
   below for details.

 - plan (0x1): we know the total size of data and they will be available
   directly after the header. The total size of data is encoded in a 32bits
   integer right after the mode file.

Note that abort from unknown mode are nasty as we do not know how much data
remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

==== Parts Options ====

Parts options are able to carry arbitrary bytes. Their encoding is therefor more
complicated than the main-options.

 - number of mandatory options (32 bits integer)
 - number of advisory options (32 bits integer)
 - pair of options size (sequence of 32bits integer couple)
 - options themselves

First is the number of mandatory and advisory option. Once the number of
options is known, we can read Nx2 number of integer to get the len of the key,
value couple of each option. Then we can proceed to reading all the options.

Note that this force all mandatory option to be read before the advisory one.

=== Part Data: ===

==== Parts Data: Stream Mode ====

In stream mode, data will be read as `<sizeofchunk><chunk>` until an empty chunk is found.

The size of each chunk is encoded in a 32 bits integer.

There is no constraint on the chunk size. But the bundler REALLY SHOULD NOT
using 1 Byte long chunk as that would be very inefficient. The bundler MAY WISH
TO stick to stable and sensible chunk size as the 4096 Byte use elsewhere in
the code base)


==== Parts data: Plain Mode ====

In plain mode we know the total size of data. So we can just read them until with reach that amount of data


=== End Of Bundle Marker ===

End of bundle is marked by an "empty Parts" with a 0 size header.

== Summary of the general structure ==

(the bundle2 format WOULD PROBABLY start with a fixed invalid//empty HG10 bundle)

 - main header
   - bundle version (unsigned Byte)
   - main options:
     - size of main options (unsigned 32 bits integer)
     - main options (text)

 - part: (any number of them)
   - size of header (32bits integer)
   - header:
     - size of type (Byte)
     - part type (string (up to 255 char))
     - part mode (Byte)
     - data size (when if applicable)
     - options: (see other section)
       - number of mandatory options (Byte)
       - number of advisory options (Byte)
       - pair of options size (sequence of 32bits integer couple)
       - options themselves
   - data (Bytes (plenty of them))

 - end of bundle marker (empty part)

Note:

This page is primarily intended for developers of Mercurial.

This page describes the current plan to get a more modern and complete bundle format. (for old content of this page check BundleFormatHG19)

(current content is copy pasted from 2.9 sprint note)

New bundle format

  • lightweight
  • new manifest
  • general delta
  • bookmarks
  • phase boundaries
  • obsolete markers
  • >sha1 support

  • pushkey
  • extensible for new features (required and optional)
  • progress information
  • resumable?
  • transaction commit markers?

It's possible to envision a format that sends a change, its manifest, and filenodes in each chunk rather than sending all changesets, then all manifests, etc. capabilities

Changes in current command

Push Orchestraction

Current situation
  • push:
    • changesets:
      • discovery
      • validation
      • actual push
    • phase:
      • discovery
      • pull
      • push
    • obsolescence
      • discovery
      • push
    • bookmark
      • discovery
      • push

Aimed orchestration

* push:

  • discovery:
    • changesets
    • phase
    • obs
    • bookmark
  • post-discovery action:
    • current usecase move phase for common changeset seen as public.
  • local-validation:
    • (much easier will everything in hands)
    • complains about:
      • multiple heads
      • new branch
      • troubles changeset
      • divergent bookmark
      • Rent in Manhattan
      • etc…
  • push:
    • (using multipart-bundle when possible)
      • The one and single remote side transaction happen here
  • (post-push) pull:
    • The server send back its own multipart-bundle to the client
      • (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)

post-push pull

If we lets the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.

The idea is to use the very same top level format. It could contains any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignores its content.

Possible use case are:

  • sending standard output back
  • sending standard error back
  • notification that a changeset was made public on push
  • notification of partially accepted changeset
  • notification of automatic bookmark move on the server
  • test case result (or test run key)
  • Automatic shipment of Pony to contributor address
  • … (Possibility are endless)

Changes in Pull

Same kind of stuff wil happen but pull is much simpler. (I'm not worried at all about it)

Change in Bundle/Unbundle

Unbundle would learn to unbundle both

Maybe we can have the new bundle format start with an invalide entry to prevent old unbundle to try to import them

bundle should be able to produce new bundle. It can probably not do it by default for a long time however :-/

Top level Bundle

content

On the remote side, the server will need to redo the validation that was done on the remote side to ensure that nothing interesting happened between discovery and push. We need to send appriopricate data to the remote for validation. This implies either argument in the command data. Or a dedicated section in the bundle. The dedicated section seems the way to go as it feels more flexible. We do not know what kind of data will be monitored and send. So we cannot build a sensible set of argument doing the job. With a dedicated section in the multi-part bundle, we can make this section evolve over time to match the evolution of data we send to the server.

forseen sections

Here are the idea we already have about section

  • HG10 (old changeset bundle format)
  • HG19 (new changeset bundle with support for modern stuff)
  • pushkey data (phase, bookmarks)
  • obsolescence markers (format 1 and upcoming format 2 ?)
  • client capacity (to be used for the reply multi part bundle)

Format of the Bundle2 Containeri

Goal

The goal of bundle2 is to act as an atomically packet to transmit a set of payloads in an application agnostic way. It consist in a sequence of "parts" that will be handed to and processed by the application layer.

A bundle2 can be read in a single pass from a stream.

bundle2 start with a small header and follow with a sequence of parts. Parts have an header of they own.

Main Header

This header contains information about the application agnostic bundle.

It is encoded as such:

  • - bundle version (unsigned Byte) - main options:
    • - size of main options (unsigned 32 bits integer) - main options (text)

Bundle version is a single unsigned Bytes used to know the version of bundle2 format used in this stream. It will be incremented when major upgrade of the protocol happen

Current value is 0.

unbundling MUST abort when an unknown protocol version is meet.

Note that abort from unknown protocol version are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Main Options

First come a 32 bits integer. Its the size in Bytes of the options themselves

The size of data in the main header. If people need more than 4GB of options I expect them to be run in other troubles before.

The main header data are the list of options at alter the behavior of the top level bundle. This is intended only to control extraction of the payload from it. This is -not- intended for any changes in the application level understanding of the payload. The list of option follow this rules

  • - Content is pure alphanumerical list of option, - The list is space separated, - Option MUST start with a letter, - upper case option are mandatory, - lower case option MAY be disregarded, - mixed case options SHOULD never happen but will be interpreted as upper case, - option may be valuated using the forms optionname=value,

    - value are alphanumerical + ., - to summarise option are a space separated list of thing matching:

    • [A-Za-z][A-Za-z0-9]+(=[A-Za-z0-9.]+)

Note that the first piece send is the size of the options section. So options themselves cannot be stream. This is one more reason why you should not intend to store huge data in main-option.

Note also that abort from unknown option are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Examples of valid main option

  • - Set a new format of part headers:
    • PARTVERSION=1

    - Have the payload use a special compression algorithm
    • COMPRESSION=DOGEZIP

    - Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)
    • PARTENCODING=GOST13052

    - Set integer format in part-header to middle-endian
    • ENDIANESS=PDP11

Example of -possibly- valid main option

  • - ask for debug level output in the reply
    • debug

    - inform of total number of parts:
    • nbparts=42

    - inform of total size of the bundle:
    • totalsize=1337

Example of -invalid- main option

  • - List of known heads (use a part for that) - username and/or credential (use a part for that)

Parts

Parts convey the application level payload of the bundle. They are handled by the application layer during the unbundle process.

A parts consist in three elements: type, options and data.

Type is a simple alphanumerical identifier that lets the application level know what kind of data the part contains and root it to the applicable processors.

  • - upper case type are mandatory and MUST be processed by the server. If the
    • server does not know how to handle an upper case type it MUST abort the unbundle process.
    - lower case type are advisory and CAN be disregarded during the unbundle process. - mixed case SHOULD not appear and WILL be interpreted as upper case one.

Options are a set of key and value that may change the way the data from this part will be processed. Some of them may be mandatory some other may be advisory

Data are the actual payload of the part.

Parts Header

  • - size of header (32bits integer) - header:
    • - size of type (Byte) - part type (string (up to 255 char)) - part mode (Byte) - data size (when applicable) - options: (see other section)

Note that first entry is the full size of the header. So the header can't be streamed and one should not plan to put massive data in the header itself. (That what parts data are meant for).

The type is an alphanumerical string of arbitrary size (<256) that will be used to find the application level part that process the data payload. It follow the upper/lower case rules explained in the previous section. Note that routing should be case insensitive. The lower case and upper case version of the same type MUST be handled by the same code. It only matters in the case no handler is found for a given type.

The mode is an enum that define how the data can be retrieved from the part. The unbundle process MUST be aborted is an unknown mode is meet. The two foreseen mode are now:

  • - stream (0x0): total size of the data are yet unknown. See dedicated section
    • below for details.
    - plan (0x1): we know the total size of data and they will be available
    • directly after the header. The total size of data is encoded in a 32bits integer right after the mode file.

Note that abort from unknown mode are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Parts Options

Parts options are able to carry arbitrary bytes. Their encoding is therefor more complicated than the main-options.

  • - number of mandatory options (32 bits integer) - number of advisory options (32 bits integer) - pair of options size (sequence of 32bits integer couple) - options themselves

First is the number of mandatory and advisory option. Once the number of options is known, we can read Nx2 number of integer to get the len of the key, value couple of each option. Then we can proceed to reading all the options.

Note that this force all mandatory option to be read before the advisory one.

Part Data:

Parts Data: Stream Mode

In stream mode, data will be read as <sizeofchunk><chunk> until an empty chunk is found.

The size of each chunk is encoded in a 32 bits integer.

There is no constraint on the chunk size. But the bundler REALLY SHOULD NOT using 1 Byte long chunk as that would be very inefficient. The bundler MAY WISH TO stick to stable and sensible chunk size as the 4096 Byte use elsewhere in the code base)

Parts data: Plain Mode

In plain mode we know the total size of data. So we can just read them until with reach that amount of data

End Of Bundle Marker

End of bundle is marked by an "empty Parts" with a 0 size header.

Summary of the general structure

(the bundle2 format WOULD PROBABLY start with a fixed invalid//empty HG10 bundle)

  • - main header
    • - bundle version (unsigned Byte) - main options:
      • - size of main options (unsigned 32 bits integer) - main options (text)
    - part: (any number of them)
    • - size of header (32bits integer) - header:
      • - size of type (Byte) - part type (string (up to 255 char)) - part mode (Byte) - data size (when if applicable) - options: (see other section)
        • - number of mandatory options (Byte) - number of advisory options (Byte) - pair of options size (sequence of 32bits integer couple) - options themselves
      - data (Bytes (plenty of them))
    - end of bundle marker (empty part)

Changesets exchange

New header

type Header struct {
    length       uint32
    lNode        byte
    node         [lNode]byte

    // if empty (lP1 ==0) then default to previous node in the stream
    lP1          byte
    p1           [lP1]byte

    // if empty, nullrev
    lP2          byte
    p2           [lP2]byte

    // if empty, self (for changelogs)
    lLinknode    byte
    linknode     [lLinknode]byte

    // if empty, p1
    lDeltaParent byte
    deltaParent  [lDeltaParent]byte 
}

We'll modify the existing changegroup type so it can pretend to be a new changegroup that just has a variety of empty fields. Progress information fields might be optional.


CategoryNewFeatures

BundleFormat2 (last edited 2018-02-10 00:05:58 by AviKelman)