Differences between revisions 13 and 14
Revision 13 as of 2014-11-02 05:34:41
Size: 9324
Editor: DurhamGoode
Revision 14 as of 2018-02-10 00:08:30
Size: 9329
Editor: AviKelman
Comment: move v2 item to the top
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== Bundle format v2 ==
''The new bundle format design is described on the BundleFormat2 page. ''''' '''

Line 79: Line 83:
== Overview of Bundles == == Overview of v1 Bundles ==
Line 160: Line 164:

== Bundle format v2 ==
''The new bundle format design is described on the BundleFormat2 page. ''''' '''

Bundle format v2

The new bundle format design is described on the BundleFormat2 page.

Bundle format is the format in which changegroups are exchanged. It is used in the WireProtocol, as well as from the command line.

On the command line, bundle files are generated with the hg bundle command. They consist of a header, followed by a block of binary data, which may be compressed. The header is 6 bytes long and indicates the compression type:

  • HG10BZ - Compressed with the Python 'bz2' module

  • HG10GZ - Compressed with the Python 'zlib' module

  • HG10UN - Not compressed.

Decompressor in Python

The following is a Python program to convert an HG10BZ or HG10GZ file into an HG10UN file:

   1 #!/usr/bin/python
   3 # Program to decompress Mercurial bundles
   4 # This program contains extracts from the Mercurial
   5 # source, and is therefore subject to the GNU General
   6 # Public License.
   8 import bz2, zlib, sys
  10 def decompress(filename):
  11         infile = open(filename, "rb")
  12         outfile = open(filename+'.uncompressed', "wb")
  13         outfile.write('HG10UN')
  14         for chunk in unbundle(infile):
  15                 outfile.write(chunk)
  17 def filechunkiter(f, size=65536, limit=None):
  18     """Create a generator that produces the data in the file size
  19     (default 65536) bytes at a time, up to optional limit (default is
  20     to read all data).  Chunks may be less than size bytes if the
  21     chunk is the last chunk in the file, or the file is a socket or
  22     some other type of file that sometimes reads less data than is
  23     requested."""
  24     assert size >= 0
  25     assert limit is None or limit >= 0
  26     while True:
  27         if limit is None: nbytes = size
  28         else: nbytes = min(limit, size)
  29         s = nbytes and f.read(nbytes)
  30         if not s: break
  31         if limit: limit -= len(s)
  32         yield s
  35 def unbundle(fh):
  36     header = fh.read(6)
  37     if header == 'HG10UN':
  38         return fh
  39     elif not header.startswith('HG'):
  40         # old-style uncompressed bundle with no header - we've read into actual data
  41         fh.seek(0)
  42         def generator(f):
  43             yield header
  44             for chunk in f:
  45                 yield chunk
  46     elif header == 'HG10GZ':
  47         def generator(f):
  48             zd = zlib.decompressobj()
  49             for chunk in f:
  50                 yield zd.decompress(chunk)
  51     elif header == 'HG10BZ':
  52         def generator(f):
  53             zd = bz2.BZ2Decompressor()
  54             zd.decompress("BZ")
  55             for chunk in filechunkiter(f, 4096):
  56                 yield zd.decompress(chunk)
  57     return generator(fh)
  59 if len(sys.argv) != 2:
  60    print "Usage: expandbundle <file>"
  61    exit()
  63 decompress(sys.argv[1])

Overview of v1 Bundles

  • To understand how the bundle format works, it is helpful to understand how changesets are committed to the repository. You might want to take a look at ChangeSet#Committing_a_new_changeset.

A bundles contains all the information necessary to add one or more changesets to a repository. A changest includes:

  • a particular version of the ChangeLog

  • a particular version of the Manifest

  • a particular version of content for each tracked file

As such, a bundle contains the following for each changeset it includes:

  • changes made to the ChangeLog

  • changes made to the Manifest
  • changes made to a set of files

The bundle is divided into three corresponding sections, with a common structure called a Group used in each section (with a slight variation for the files section). Each Group is composed of one or more structures called Chunks, each of which is simply a 4-byte len field followed by data. The len field is interpretted as a big-endian integer and specifies the number of bytes in the entire Chunk, i.e., it includes its own 4 bytes.


4 bytes - big-endian

(len - 4) bytes



The group is terminated by a NullChunk, which is simply a Chunk whose len is no more than 4 and therefore has no data (but a NullChunk always has all 4 bytes for the len field).


Chunk 0


Null Chunk

The changelog section and the manifest section of the bundle are both simply one group each. The final section is called the Filelist, and it is sequence of two-tuples, one two-tuple for each file that was modified by any one of the changesets in the bundle. Each two-tuple in the Filelist contains the file's path, in the form of a chunk, and a Group. the filelist is termianted by a NullChunk (in place of the next filepath chunk):


FileEntry 0


FileEntry F-1


filepath (Chunk)

filedata (Group)

filepath (Chunk)

filedata (Group)


Putting it all together a bundle looks like this:


Changelog (Group)

Manifest (Group)


Chunk 0


Chunk C-1


Chunk 0


Chunk M-1


FileEntry 0


FileEntry F-1


Inside the Groups

Inside each Group (1 Group for the changelog, 1 Group for the manifest, 1 Group for each of the modified files), there is 1 Chunk for each changeset which is included in the Bundle. These Chunks are a special species of Chunk called a RevChunk, which have the Chunk data further divided into the following fields:

4 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

(len - 84) bytes



p1 (parent 1)

p2 (parent 2)

cs (changeset link)


The len field is the same as for all other Chunks, it specifies to the total number of bytes in the chunk. The next four fields are each 20 byte nodeids, stored in big-endian binary form (as opposed to the ASCII hexidecimal form commonly seen by the user). Each RevChunk contains the data needed to create a new entry in the corresponding revlog (revlog for the changelog, manifest, or tracked file), and the required nodeids are stored in the four fields. The node field is the identifier for the new entry that the RevChunk creates, while p1 and p2 are the nodeids for the new entry's parents. The cs fields is the ChangeSetId for the changeset that this RevChunk belongs to.

Lastly, the revdata field contains a sequence of structures called RevDiffs, each with the following format:

4 bytes - big-endian

4 bytes - big-endian

4 bytes - big-endian

blocklen bytes





Note that the sequence of RevDiffs does not need to be terminated, because the total length of the revdata is known (len - 84).

Each RevDiff item specifies a simple single-hunk patch, in the form of a replacement. The RefDiff item says to replace the bytes from offset start up to (but not including) offset end with the specified textdata. To delete text, the textdata would be empty (blocklen would be 0), and to insert text, the start and end would be the same.

For instance, the following python code applies a patch specified by a RevDiff:

   1 #!/usr/bin/python
   3 def applypatch(original, start, end, textdata):
   4     pre = original[:start]
   5     post = original[end:]
   6     return pre + textdata + post

Taken all together, the sequence of RevDiffs in a RevChunk's revdata field indidate how to transform the parent version into the new version. For instance, for a RevChunk in the Changelog, it specifies how to change the parent changelog into the new changelog. For a RevChunk in on of the FileEntry's, it specifies how to change the parent version of the file into the new version of the file.

Note that the RevDiffs in a given RevChunk must be in order from the beginning of the file to the end of the file, and the start and end offsets always refer to offsets in the original file, not the results of applying the previous RevDiffs in the sequence. This allows individual RevDiffs to be applied selectively, without applying any others. However, it means that in order to apply more than one RevDiff, the must be applied in reverse order: patches that have a greater start offset must be applied first so that they don't change the offsets for other patches.

For the first revision of a file (or of the changelog or manifest), the RevDiffs are applies against an empty string.

BundleFormat (last edited 2018-02-10 00:09:33 by AviKelman)