[RFC] kbfiles: an extension to track binary files with less wasted bandwidth

Andrew Pritchard andrewp at fogcreek.com
Tue Jul 26 13:23:50 CDT 2011


The goal of kbfiles is to maintain the benefit of version tracking for binary
files without requiring clones and pulls to download versions of large,
incompressible files that will likely never be needed.  These files are
replaced, according to the user's configuration, with small standin files
containing only the SHA1 sum of the binary file.  Mercurial then tracks these
standin files, keeping history small, while the binary files are retrieved
only as needed (when updating, for example).

The reasoning behind this is that binary files are frequently large and already
compressed as part of their format, and as such, compressed diffs don't work
very well to track their changes.  Since it is common for many types of
software development (game development being a particularly strong example) to
have large volumes of binary assets, without an extension like kbfiles, clones
can end up being a single many-gigabyte transaction, whereas kbfiles allows
this to be split into smaller transactions and avoid transferring most of the
data altogether.  Kbfiles also avoids diffing the binary files, transferring
them as they are in any given revision.  Finally, the size of data stored
locally is greatly decreased for common use cases, in which old versions
of binary assets are not often needed.

The typical use case is to have these binary files available on a central
server, though retrieving bfiles from both SSH and HTTP Mercurial repositories
is supported in the wire protocol.  There are three locations that will be
checked to find the required big files:
- The repository-local cache, in .hg/kilnbfiles (this will be changed as needed
with the name of the extension);
- The configurable system cache, defaulting to $HOME/.kilnbfiles on POSIX-y
  systems and AppData\Local\kilnbfiles on Windows; and
- The default or default-push remote paths in .hg/hgrc.

The system cache may be on network storage, so that an entire network of
developers may share their files over NFS or SMB.

When a file is committed as a bfile, it is copied to the repository-local cache
and to the system cache, and its standin is written in .kbf/.  When pushing
changes to bfiles to a remote repository, any changed bfiles are uploaded with
the changesets.  When pulling, though, only the changesets are transferred,
greatly reducing clone sizes for repositories containing heavily-edited binary
files.  Then, when updating to a revision with changes to bfiles, the required
versions of the files are retrieved from either the system cache or the remote
repository.

kbfiles has several mechanisms for defending its repositories against damage
from non-kbfiles clients:
- add a 'kbfiles' line to .hg/requires in order to keep non-kbfiles clients
  from breaking things;
- add a 'bfilestore' server capability, without which the client will not
  attempt to interact with a remote repository when the local repository uses
  kbfiles; and
- prepend 'kbfiles\n' to the output of the heads command when serving kbfiles
  repositories to prevent non-kbfiles clients from creating broken clones.

The last of these is fairly likely to be controversial, but it currently seems
to be necessary.  Although the HG19 bundle format as described on the wiki
would appear to solve the problem with its feature strings, it also does not
appear to be implemented yet.  If and when it is, kbfiles will replace the
heads command hack with a 'kbfiles' bundle feature.  Unfortunately, the result
is that non-kbfiles clients throw an exception with no mention of kbfiles, but
we could not find a way to make the client display a useful error message while
consistently preventing them from uploading changesets without the
corresponding bfiles or creating clones that are missing files.

As it stands, as long as either the client or the server has the current
version of kbfiles or either repo has been touched by the current version of
kbfiles, there are no known cases that cause missing bfiles.

The extension wraps most operations on repositories to handle bfiles specially;
this can be seen in bfsetup.py.  It also explicitly handles cooperation with
several other extensions, including fetch, purge, and rebase.

Bfile transfer is implemented via three additions to the wire protocol on
servers with the extension loaded:
- statbfile, which returns 0, 1, or 2 depending on whether the requested bfile
  (as identified by the SHA1 sum) is present and valid, invalid, or missing;
- getbfile, which returns the requested bfile along with its length to allow
  the ssh protocol to avoid reading beyond its end (without modifying Mercurial
  core code that attempts to encode passed-in file-like object as bundles); and
- putbfile, which hashes and verifies the received data and places it in the
  repository-local and system caches.

The extension also currently supports talking to previous versions of Kiln that
still serve bfiles over a different interface, via POST and GET requests to
$REPO/bfile/$SHA.  Although we would prefer to keep this in the extension, we
are able and willing to pull it out into its own meta-extension if necessary.

We are still in the process of cleaning up the code to ship with Mercurial, but
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files.  Before the 'real'
pull request, we will collapse it into a single patch in the hgext directory.
Planned changes before then include removing compatibility shims for old
versions of Mercurial and some minor rebranding to remove mentions of 'Kiln'
from the code and repository layout.

We would prefer to avoid renaming the extension if possible, both to avoid
adding extra code to handle both old repositories and new ones and to reflect
the heritage of the extension, but we understand that parts of the Mercurial
community may be opposed to the name 'kbfiles', and as such we are willing to
rename to 'terafiles' if the name would otherwise block the extension from
shipping with Mercurial.


More information about the Mercurial-devel mailing list