Differences between revisions 5 and 6
Revision 5 as of 2009-10-24 22:31:56
Size: 7170
Editor: JuanDelgado
Comment:
Revision 6 as of 2009-10-24 22:51:56
Size: 7179
Editor: JuanDelgado
Comment:
Deletions are marked like this. Additions are marked like this.
Line 79: Line 79:
Uncompressed bundle files consist of the ''HG10UN'' header followed by a ''changegroup'', which, according to the WireProtocol page, is: Uncompressed bundle files consist of the optional ''HG10UN'' header followed by a ''changegroup'', which, according to the WireProtocol page, is:

Bundle format is the format in which changegroups are exchanged. It is used in the WireProtocol, as well as from the command line.

On the command line, bundle files are generated with the hg bundle command. They consist of a header, followed by a block of binary data, which may be compressed. The header is 6 bytes long and indicates the compression type:

  • HG10BZ - Compressed with the Python 'bz2' module

  • HG10GZ - Compressed with the Python 'zlib' module

  • HG10UN - Not compressed.

Decompressor in Python

The following is a Python program to convert an HG10BZ or HG10GZ file into an HG10UN file:

   1 #!/usr/bin/python
   2 
   3 # Program to decompress Mercurial bundles
   4 # This program contains extracts from the Mercurial
   5 # source, and is therefore subject to the GNU General
   6 # Public License.
   7 
   8 import bz2, zlib, sys
   9 
  10 def decompress(filename):
  11         infile = open(filename, "rb")
  12         outfile = open(str.join('',[filename,'.uncompressed']), "wb")
  13         outfile.write('HG10UN')
  14         for chunk in unbundle(infile):
  15                 outfile.write(chunk)
  16 
  17 def filechunkiter(f, size=65536, limit=None):
  18     """Create a generator that produces the data in the file size
  19     (default 65536) bytes at a time, up to optional limit (default is
  20     to read all data).  Chunks may be less than size bytes if the
  21     chunk is the last chunk in the file, or the file is a socket or
  22     some other type of file that sometimes reads less data than is
  23     requested."""
  24     assert size >= 0
  25     assert limit is None or limit >= 0
  26     while True:
  27         if limit is None: nbytes = size
  28         else: nbytes = min(limit, size)
  29         s = nbytes and f.read(nbytes)
  30         if not s: break
  31         if limit: limit -= len(s)
  32         yield s
  33 
  34 
  35 def unbundle(fh):
  36     header = fh.read(6)
  37     if header == 'HG10UN':
  38         return fh
  39     elif not header.startswith('HG'):
  40         # old client with uncompressed bundle
  41         def generator(f):
  42             yield header
  43             for chunk in f:
  44                 yield chunk
  45     elif header == 'HG10GZ':
  46         def generator(f):
  47             zd = zlib.decompressobj()
  48             for chunk in f:
  49                 yield zd.decompress(chunk)
  50     elif header == 'HG10BZ':
  51         def generator(f):
  52             zd = bz2.BZ2Decompressor()
  53             zd.decompress("BZ")
  54             for chunk in filechunkiter(f, 4096):
  55                 yield zd.decompress(chunk)
  56     return generator(fh)
  57 
  58 if len(sys.argv) != 2:
  59    print "Usage: expandbundle <file>"
  60    exit()
  61 
  62 decompress(sys.argv[1])

Breakdown of uncompressed bundle files

Uncompressed bundle files consist of the optional HG10UN header followed by a changegroup, which, according to the WireProtocol page, is:

  • A changelog group

  • A manifest group

  • A list of:
    • filename length
    • filename
    • file group

A group is a series of chunks, terminated by a null chunk. Non-null chunks have the following form:

4 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

(len - 84) bytes

len

node

p1

p2

cs

Data

A null chunk is simply a 4-byte big-endian integer that is less than or equal to 4.

The Changelog ''group''

Each non-null chunk in the Changelog group contains in its Data section a patch against a text that is only stored in the form of other patches, beginning with a patch against the empty string (""). Each Changelog entry patches the result of all previous patches (the previous, or parent patch of a given patch p is the patch that has a node equal to p's p1 field), so that the final text contains the following newline-terminated fields:

  • A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest group

  • A string describing the user who committed this entry.
  • A Unix time stamp, followed by a space, followed by an offset from UTC, in seconds.
  • The name of a file.
  • A description of the change that was made.

Or, in BNF:

<final-text> ::= <node-id> "\n" <user-id> "\n" <unix-timestamp> " " <utc-offset> "\n"
                 <filename> "\n" <description>
<node-id> ::= {<hexadecimal-character>}{40}
<user-id> ::= {<printable-character>}*
<unix-timestamp> ::= {<numeral>}*
<utc-offset> ::= {<numeral>}*
<filename> ::= {<os-accepted-filename-character>}*
<description> ::= {<printable-character>}*

The Data of the chunk contains a series of binary blocks that represent patches. These blocks have the following form:

4 bytes - big-endian

4 bytes - big-endian

4 bytes - big-endian

blocklen bytes

start

end

blocklen

textual data

Producing the Changelog entry's text from the patches describing it

In all Changelog entries, start and end represent the byte positions within the previous entry text version that are to be replaced with the textual data. Each start within the same chunk must be greater than or equal to the previous block's end. No end can be greater than the total length of the previous entry version. Given any non-merging Changelog entry and the final text of the previous entry, the next final text is given by the following pseudocode (++ is the append operator):

 patch(blocks, previous_text) =
    /* blocks[-1] is the block before the one currently being
     * processed, or a block containing all zeros
     * if the first block in the entry is being processed.
     */
    if blocks == [] then
       return previous_text[blocks[-1].end .. $]
    else
       return previous_text[blocks[-1].end .. blocks[0].start]
              ++ blocks[0].textual_data ++ patch(blocks[1..$], previous_text)
    end if
 end

For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the nth Changelog entry to its ancestors is shown by the following pseudocode:

 entries[0] := diff(first_final_text, "")

 if p2 == "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0" then
    entries[n] := diff(final_text,
                       patch(entries[n-1], patch(entries[n-2],
                                                 patch(..., patch(entries[0], "")))))
 else
    error "I don't know what the relationship is for merging patches"
 end

Calculating a Changelog chunk's node field

The chunk containing a set of Changelog blocks must have a node that is calculated according to the following pseudocode:

 sha1_digest(min(p1,p2) ++ max(p1,p2) ++ final_text)

p1 and p2 must be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the final text for Manifest nodes. Numerous libraries exist that implement the sha1_digest function.

BundleFormat (last edited 2018-02-10 00:09:33 by AviKelman)