[RFC] kbfiles: an extension to track binary files with less wasted bandwidth

Thu Aug 4 09:13:53 CDT 2011

Finally catching up on this thread -- sorry for the delay.

On Tue, Jul 26, 2011 at 2:23 PM, Andrew Pritchard <andrewp at fogcreek.com> wrote:
> The goal of kbfiles is to maintain the benefit of version tracking for binary
> files without requiring clones and pulls to download versions of large,
> incompressible files that will likely never be needed.  These files are
> replaced, according to the user's configuration, with small standin files
> containing only the SHA1 sum of the binary file.  Mercurial then tracks these
> standin files, keeping history small, while the binary files are retrieved
> only as needed (when updating, for example).

Gee, this sounds familiar. Did I write that? No, that's actually a
good paraphrase of my words and ideas.

> The reasoning behind this is that binary files are frequently large and already
> compressed as part of their format, and as such, compressed diffs don't work
> very well to track their changes.

That also sounds familiar, but again it's a good paraphrasing.

> When a file is committed as a bfile, it is copied to the repository-local cache
> and to the system cache, and its standin is written in .kbf/.  When pushing
> changes to bfiles to a remote repository, any changed bfiles are uploaded with
> the changesets.  When pulling, though, only the changesets are transferred,
> greatly reducing clone sizes for repositories containing heavily-edited binary
> files.  Then, when updating to a revision with changes to bfiles, the required
> versions of the files are retrieved from either the system cache or the remote
> repository.
>
> kbfiles has several mechanisms for defending its repositories against damage
> from non-kbfiles clients:
> - add a 'kbfiles' line to .hg/requires in order to keep non-kbfiles clients
>  from breaking things;
> - add a 'bfilestore' server capability, without which the client will not
>  attempt to interact with a remote repository when the local repository uses
>  kbfiles; and
> - prepend 'kbfiles\n' to the output of the heads command when serving kbfiles
>  repositories to prevent non-kbfiles clients from creating broken clones.

Good stuff. These are things I have never addressed in bfiles, and
that need to be addressed. I'm glad you've taken care of them.

> Bfile transfer is implemented via three additions to the wire protocol on
> servers with the extension loaded:
> - statbfile, which returns 0, 1, or 2 depending on whether the requested bfile
>  (as identified by the SHA1 sum) is present and valid, invalid, or missing;
> - getbfile, which returns the requested bfile along with its length to allow
>  the ssh protocol to avoid reading beyond its end (without modifying Mercurial
>  core code that attempts to encode passed-in file-like object as bundles); and
> - putbfile, which hashes and verifies the received data and places it in the
>  repository-local and system caches.

This also sounds better than bfiles -- I never touched Mercurial's
wire protocol.

> The extension also currently supports talking to previous versions of Kiln that
> still serve bfiles over a different interface, via POST and GET requests to
> $REPO/bfile/$SHA.  Although we would prefer to keep this in the extension, we
> are able and willing to pull it out into its own meta-extension if necessary.

I think Matt is right: now is the time to jettison
backwards-compatibility legacy code, even if it makes life harder for
people like me (using bfiles, looking very carefully at
kbfiles/largefiles to see if switching is a win).

> We are still in the process of cleaning up the code to ship with Mercurial, but
> the current status can be seen at
> http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files.  Before the 'real'
> pull request, we will collapse it into a single patch in the hgext directory.

That's a *terrible* idea! You should preserve history!

Actually, your existing kbfiles repository already discards history.
Fog Creek has conveniently collapsed all of *my* work, plus an unknown
amount of forking and hacking, into a single large revision 0. That's
just wrong. It's morally wrong because it deprives the original author
(me) of public credit for his work. And it's technically wrong because
it makes it much harder to trace a given line of code back in history.

So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.

Luckily, it's fixable: start with a clone of bfiles, possibly
truncated at your fork point, alongside your private internal
repository. Apply patches from your internal repo to the bfiles clone.
Then apply patches from the public kbfiles repo. End result: a
legitimate repository that captures the true history of the project,
without erasing anyone's contribution. Final step: rename things into
hgext/largefiles so the whole thing can be pulled into Mercurial.

Finally, I have two *technical* objections: the use of dirstate and
the use of standin files. I know, it's pretty rich for *me* to
criticise kbfiles/largefiles for using my design. But I'm in a pretty
good position to know where I got things wrong.

First, the use of a dedicated dirstate for big files was dumb and lazy
on my part. Big files have a different life-cycle from regular files,
and trying to shoehorn them into a separate dirstate instance just
doesn't work very well. I think the right thing to do is 1) draw a
diagram of the complete life-cycle of big files, 2) implement a custom
data structure (similar idea to dirstate) that tracks that life-cycle,
3) ditch the current hodge-podge of state-tracking mechanisms. I
haven't got very far on this, since bfiles is now a
weekends-and-evenings (and occasional quiet days at work) project for
me. Anyways, this is fixable by just dedicating some programmer time
to the problem.

Second, I'm not convinced that the fundamental design of bfiles -- the
use of standin files -- is appropriate. It complicates things a lot,
and the main reason I chose it was to allow partial bfupdate -- i.e.
don't make me fetch all of the big files in my working directory; I
just want to fetch some of them. I still think that's a nice feature,
but I wonder if it's worth the complication.

The approach taken by 'snap', where the big file hashes are stored
right in file revlogs, sounds interesting. I peeked at the code for
snap once, and was put off by the sheer volume of code. But it's an
interesting idea.

Alas, I'm not sure this is fixable. There are people out there in the
real world using bfiles and/or kbfiles/largefiles, and changing the
fundamental design would break all of their repositories and bfile
stores. ;-(

Oh yeah, for the record, I like the name 'largefiles'.

Greg