RFC: version (big) file snapshots with storage outside a Mercurial repo with snap

Greg Ward greg-hg at gerg.ca
Wed Aug 25 08:40:59 CDT 2010


On Thu, Aug 19, 2010 at 6:06 AM, Dirkjan Ochtman <dirkjan at ochtman.nl> wrote:
> Okay, but it seems you did not publish your concerns about bfiles'
> approach before embarking on coding your own, correct?
>
> You may dismiss Adrian's concerns about code size as FUD, but it would
> seriously have helped your case if there was a public conversation
> between you and Greg and Andrei about design for these features. Now,
> we just have your big wad of code (written by just you in five days)
> and their big wad of code (written by a bunch of people over the
> course of a few months, and in actual use at several companies IIUC).
>
> I'd hoped Greg Ward would comment here about his thoughts on your
> extension, but he hasn't done so yet.

I was on vacation last week.  I'm catching up on this thread with
great interest!

Naturally, my initial reaction was: "hey, you're reinventing bfiles!".
 Then I re-read Klaus' post more carefully and saw that he's taking a
very different approach that sounds interesting and promising.  So I'm
not outraged or offended or inclined to get into a squabble over
"territory".  I'll leave that to my friends in academia.  ;-)  (No
really!  Scientists actually get peeved at other scientists for doing
good science that *they* wanted to do!)

Rather than comment on Klaus' extension, since I have not looked at
the code, I will bare my soul on the first eight months with bfiles in
production.  In a nutshell: it could be a lot better, but I don't
think it's doomed.  There are two obvious design flaws in bfiles:
  * representation of working dir state is a mess, because I reused dirstate
    and it doesn't really fit
  * having a tree .hgbfiles/ breaks rename badly

The first one I think I can fix: my idea is to invent a dirstate-like
data structure that represents everything bfiles needs to know about
the state of big files in the current dir.

I'm not sure what to do about rename.  Rename + bfiles sucks very
badly right now.

As for automatic integration of big files, it's mostly done.  You can
configure it so "hg update", "hg status", "hg commit", and "hg push"
all do the right thing with respect to big files.  That's not the
default behaviour because I want the ability to dig in and see how
things really work for myself.  If a great tide of users clamours for
this to change, it can be changed.

The only missing piece of automatic integration is "hg add"; you still
need to explicitly "hg bfadd" new big files.  Fixing that is a matter
of coming up with acceptable criteria (file size? file name? file
contents?), a syntax for specifying those criteria, and implementing
it.  Not trivial but not insurmountable.

Anyways, the great big question mark hanging over bfiles is whether
the .hgbfiles/ directory tree is the right way to track big file
metadata.  It *works*, but it isn't perfect.  And I have a
110,000-changeset repo with 9 years of history in it, and files in
.hgbfiles/ going back to 2002 (our conversion from CVS automatically
added big files on the fly).  Replacing .hgbfiles/ with something else
would be a big pill for me to swallow; it would have to be a really
stunningly superior solution.

The other big question is "how many extensions does it take to handle
big files?"  Here's the current lay of the land:

  * bigfiles: small and simple; but underdocumented and untested (last
time I looked)
    and therefore I could not get it to work
  * bfiles: moderately large and quite complex; decent docs and good
tests; mostly works;
    similar basic design to bigfiles
    - Chad Dombrova has been working on major changes to bfiles to try
to address
      various shortcomings; in the worst case, he might have to fork
an incompatible
      version, but we have been communicating and both of us very much
want to avoid
      a fork
  * snap: large (according to Adrian); radically different design

As has been pointed out, I wrote bfiles because bigfiles did not work
for me, despite the attempt by its author and me to get it to work.
They have a similar basic design, but bfiles has a lot more features.
In particular, it has wire protocols for getting big files back and
forth to a central store; as I understand it, bigfiles leaves that up
to the user.

My bottom line on snap: let a thousand flowers bloom.  I don't know if
putting metadata explicitly in the filelog is better than doing it
implicitly, but it sounds promising.

Greg


More information about the Mercurial-devel mailing list