Differences between revisions 24 and 25
Revision 24 as of 2014-03-23 19:32:14
Size: 12289
Editor: mpm
Comment: pages start with titles
Revision 25 as of 2014-03-27 17:49:48
Size: 12080
Comment:
Deletions are marked like this. Additions are marked like this.
Line 300: Line 300:
===== Parts Data: Stream Mode =====

In
stream mode, data will be read as `<sizeofchunk><chunk>` until an empty chunk is found.

Data is read as `<sizeofchunk><chunk>` until an empty chunk is found.
Line 310: Line 309:


===== Parts data: Plain Mode =====

In plain mode we know the total size of data. So we can just read them until with reach that amount of data

Note:

This page is primarily intended for developers of Mercurial.

BundleFormat2

This page describes the current plan to get a more modern and complete bundle format. (for old content of this page check BundleFormatHG19)

(current content is copy pasted from 2.9 sprint note)

Why a New bundle format?

  • lightweight
  • new manifest
  • general delta
  • bookmarks
  • phase boundaries
  • obsolete markers
  • >sha1 support

  • pushkey
  • extensible for new features (required and optional)
  • progress information
  • resumable?
  • transaction commit markers?
  • recursive (to be able to bundle subrepos)

It's possible to envision a format that sends a change, its manifest, and filenodes in each chunk rather than sending all changesets, then all manifests, etc. capabilities

Changes in current command

Push Orchestraction

Current situation

  • push:
    • changesets:
      • discovery
      • validation
      • actual push
    • phase:
      • discovery
      • pull
      • push
    • obsolescence
      • discovery
      • push
    • bookmark
      • discovery
      • push

Aimed orchestration

* push:

  • discovery:
    • changesets
    • phase
    • obs
    • bookmark
  • post-discovery action:
    • current usecase move phase for common changeset seen as public.
  • local-validation:
    • (much easier will everything in hands)
    • complains about:
      • multiple heads
      • new branch
      • troubles changeset
      • divergent bookmark
      • missing subrepo revisions
      • Rent in Manhattan
      • etc…
  • push:
    • (using multipart-bundle when possible)
      • The one and single remote side transaction happen here
  • (post-push) pull:
    • The server send back its own multipart-bundle to the client
      • (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)

post-push pull

If we lets the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.

The idea is to use the very same top level format. It could contains any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignores its content.

Possible use case are:

  • sending standard output back
  • sending standard error back
  • notification that a changeset was made public on push
  • notification of partially accepted changeset
  • notification of automatic bookmark move on the server
  • test case result (or test run key)
  • Automatic shipment of Pony to contributor address
  • … (Possibility are endless)

Changes in Pull

Same kind of stuff will happen but pull is much simpler. (I'm not worried at all about it). May efficiently pull subrepo revisions.

Change in Bundle/Unbundle

Unbundle would learn to unbundle both

Maybe we can have the new bundle format start with an invalid entry to prevent old unbundle to try to import them

bundle should be able to produce new bundle. It can probably not do it by default for a long time however :-/

We could also do a "recursive bundle" in the presence of subrepos. A bundle could contain parts that are bundles of the subrepo revisions referenced by the revisions contained in the main bundle.

Top level Bundle

content

On the remote side, the server will need to redo the validation that was done on the remote side to ensure that nothing interesting happened between discovery and push. We need to send appriate data to the remote for validation. This implies either argument in the command data. Or a dedicated section in the bundle. The dedicated section seems the way to go as it feels more flexible. We do not know what kind of data will be monitored and send. So we cannot build a sensible set of argument doing the job. With a dedicated section in the multi-part bundle, we can make this section evolve over time to match the evolution of data we send to the server.

forseen sections

Here are the idea we already have about section

  • HG10 (old changeset bundle format)
  • HG19 (new changeset bundle with support for modern stuff)
  • pushkey data (phase, bookmarks)
  • obsolescence markers (format 1 and upcoming format 2 ?)
  • client capacity (to be used for the reply multi part bundle)
  • presence of subrepo bundles

Format of the Bundle2 Container

Goal

The goal of bundle2 is to act as an atomically packet to transmit a set of payloads in an application agnostic way. It consist in a sequence of "parts" that will be handed to and processed by the application layer.

A bundle2 can be read in a single pass from a stream.

bundle2 start with a small header and follow with a sequence of parts. Parts have an header of they own.

Main Header

This header contains information about the application agnostic bundle.

It is encoded as such:

  • Magic string 'HG20
  • stream parameter:
    • size of main stream parameters (unsigned 16 bits integer)
    • main stream parameter (text)

unbundling MUST abort when an unknown Magic string is met

Note that abort from unknown magic string are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Stream Options

First come a 16 bits integer. Its the size in Bytes of the parameters themselves

The size of data in the main header. If people need more than 64k of parameters I expect them to be run in other troubles before.

The main header data are the list of parameters that alter the behavior of the top level bundle. This is intended only to control extraction of the payload part. This is -not- intended for any changes in the application level understanding of the payload. The parameters are formated as space separated list of entry. Each entry is in the form <name>[=<value>]. both name and value are urlquoted. The entry name MUST start with a letter. Those with an capital first letter will are mandatory, the unbundling process MUST abort is an unknown mandatory parameter is encountered. Those with a lower case first letter may be safely ignored when unknown.

Note that the first piece send is the size of the parameters section. So parameters themselves cannot be stream. This is one more reason why you should not intend to store huge data in main-option.

Note also that abort from unknown option are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Examples of valid stream option

Those are example **not actual proposal of final parameters**. Some of them are actually very clowny.

  • Set a new format of part headers:
    • PARTVERSION=1

  • Have the payload use a special compression algorithm
    • COMPRESSION=DOGEZIP

  • Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)
    • PARTENCODING=GOST13052

  • Set integer format in part-header to middle-endian
    • ENDIANESS=PDP11

Example of -possibly- valid main option
  • ask for debug level output in the reply
    • debug

  • inform of total number of parts:
    • nbparts=42

  • inform of total size of the bundle:
    • totalsize=1337

Example of -invalid- main option
  • List of known heads (use a part for that)
  • username and/or credential (use a part for that)

Parts

Parts convey the application level payload of the bundle. They are handled by the application layer during the unbundle process.

A parts consist in three elements: type, parameters and data.

Type is a simple alphanumerical identifier that lets the application level know what kind of data the part contains and root it to the applicable processors.

  • Capital first letter type are mandatory and MUST be processed by the server. If the
    • server does not know how to handle an upper case type it MUST abort the unbundle process.
  • lower case first letter type are advisory and CAN be disregarded during the unbundle process.

Options are a set of key and value that may change the way the data from this part will be processed. Some of them may be mandatory some other may be advisory

Data are the actual payload of the part.

Parts Header

  • size of header (16bits integer)
  • header:
    • size of type (Byte)
    • part type (string (up to 255 char))
    • parameters: (see other section)

Note that first entry is the full size of the header. So the header can't be streamed and one should not plan to put massive data in the header itself. (That what parts data are meant for).

The type is an alphanumerical string of arbitrary size (<256) that will be used to find the application level part that process the data payload. It follow the upper/lower case rules explained in the previous section. Note that routing should be case insensitive. The lower case and upper case version of the same type MUST be handled by the same code. It only matters in the case no handler is found for a given type.

Parts Options

Parts parameters are able to carry arbitrary bytes. Their encoding is therefor more complicated than the stream parameters.

  • number of mandatory parameters (Byte)
  • number of advisory parameters (Byte)
  • pair of parameters size (sequence of Byte couple)
  • parameters themselves

First is the number of mandatory and advisory parameters. Once the number of parameters is known, we can read Nx2 number of Bytes to get the len of the key, value couple of each parameters. Then we can proceed to reading all the parameters.

Note that this force all mandatory parameter to be read before the advisory one.

Part Data:

Data is read as <sizeofchunk><chunk> until an empty chunk is found.

The size of each chunk is encoded in a 32 bits integer.

There is no constraint on the chunk size. But the bundler REALLY SHOULD NOT using 1 Byte long chunk as that would be very inefficient. The bundler MAY WISH TO stick to stable and sensible chunk size as the 4096 Byte use elsewhere in the code base)

End Of Bundle Marker

End of bundle is marked by an "empty Parts" with a 0 size header.

Summary of the general structure

(the bundle2 format WOULD PROBABLY start with a fixed invalid//empty HG10 bundle)

  • main header
    • bundle version (unsigned Byte)
    • main parameters:
      • size of main parameters (unsigned 16 bits integer)
      • main parameters (text)
  • part: (any number of them)
    • size of header (16bits integer)
    • header:
      • size of type (Byte)
      • part type (string (up to 255 char))
      • parameters: (see other section)
        • number of mandatory parameters (Byte)
        • number of advisory parameters (Byte)
        • pair of parameters size (sequence of Byte couple)
        • parameters themselves
    • data (Bytes (plenty of them))
  • empty part (act a end of bundle marker)

New type of Part

Changesets exchange

New header

type Header struct {
    length       uint32
    lNode        byte
    node         [lNode]byte

    // if empty (lP1 ==0) then default to previous node in the stream
    lP1          byte
    p1           [lP1]byte

    // if empty, nullrev
    lP2          byte
    p2           [lP2]byte

    // if empty, self (for changelogs)
    lLinknode    byte
    linknode     [lLinknode]byte

    // if empty, p1
    lDeltaParent byte
    deltaParent  [lDeltaParent]byte 
}

We'll modify the existing changegroup type so it can pretend to be a new changegroup that just has a variety of empty fields. Progress information fields might be optional.


CategoryNewFeatures

BundleFormat2 (last edited 2018-02-10 00:05:58 by AviKelman)