RFC: version (big) file snapshots with storage outside a Mercurial repo with snap

Martin Geisler mg at aragost.com
Mon Aug 30 06:58:24 CDT 2010


Greg Ward <greg-hg at gerg.ca> writes:

> On Thu, Aug 19, 2010 at 6:06 AM, Dirkjan Ochtman <dirkjan at ochtman.nl> wrote:
>>
>> I'd hoped Greg Ward would comment here about his thoughts on your
>> extension, but he hasn't done so yet.
>
> I was on vacation last week. I'm catching up on this thread with great
> interest!

Ah, that's good to hear!

> Naturally, my initial reaction was: "hey, you're reinventing bfiles!".
> Then I re-read Klaus' post more carefully and saw that he's taking a
> very different approach that sounds interesting and promising. So I'm
> not outraged or offended or inclined to get into a squabble over
> "territory". I'll leave that to my friends in academia. ;-) (No
> really! Scientists actually get peeved at other scientists for doing
> good science that *they* wanted to do!)
>
> Rather than comment on Klaus' extension, since I have not looked at
> the code, I will bare my soul on the first eight months with bfiles in
> production.  In a nutshell: it could be a lot better, but I don't
> think it's doomed.  There are two obvious design flaws in bfiles:
>   * representation of working dir state is a mess, because I reused dirstate
>     and it doesn't really fit
>   * having a tree .hgbfiles/ breaks rename badly
>
> The first one I think I can fix: my idea is to invent a dirstate-like
> data structure that represents everything bfiles needs to know about
> the state of big files in the current dir.
>
> I'm not sure what to do about rename. Rename + bfiles sucks very badly
> right now.

Okay -- this was also one of the things Klaus highlighted when I asked
him why he had reinvented bfiles :) The use of stand-in files in the
.hgbfiles directory versus the use of "in-place" files seems to simplify
rename handling since you can just let Mercurial do what it normally
does, e.g., you don't have to handle renames.

> As for automatic integration of big files, it's mostly done. You can
> configure it so "hg update", "hg status", "hg commit", and "hg push"
> all do the right thing with respect to big files.

Ah, that's cool. I think a tight integration is the best option in order
to make people use this, or rather, in order to not waste the users time
by having them thinking about version control unless really necessary.

> The only missing piece of automatic integration is "hg add"; you still
> need to explicitly "hg bfadd" new big files.  Fixing that is a matter
> of coming up with acceptable criteria (file size? file name? file
> contents?), a syntax for specifying those criteria, and implementing
> it.  Not trivial but not insurmountable.

Right, that shouldn't be hard.

> Anyways, the great big question mark hanging over bfiles is whether
> the .hgbfiles/ directory tree is the right way to track big file
> metadata. It *works*, but it isn't perfect. And I have a
> 110,000-changeset repo with 9 years of history in it, and files in
> .hgbfiles/ going back to 2002 (our conversion from CVS automatically
> added big files on the fly). Replacing .hgbfiles/ with something else
> would be a big pill for me to swallow; it would have to be a really
> stunningly superior solution.

Yeah, that makes sense. I talked with Benoit about the snap extension on
IRC, and as I understood it, he had expected the integration to be done
on an even lower level. Right now snap stores the pointer to the big
file in the filelog as the file data. You can see this when 'hg diff'
shows lines like this:

-.snap://file.txt.271ac93c44ac198d92e706c6d6f1d84aefcfa337
+.snap://file.txt.7bee8f3b184e1e141ff76efe369c3b8bfc50e64c

He has expected the pointer to be implemented via metadata in a similar
way to how light-weight copies (lw-copy) will work: lw-copy makes one
revlog reference another, transparent big files could (hopefully) use
the same mechanism to reference the external data.

(Benoit: please correct me if I'm misinterpreting you! :)

> The other big question is "how many extensions does it take to handle
> big files?"  Here's the current lay of the land:
>
>   * bigfiles: small and simple; but underdocumented and untested (last
>     time I looked) and therefore I could not get it to work
>
>   * bfiles: moderately large and quite complex; decent docs and good
>     tests; mostly works; similar basic design to bigfiles
>
>     - Chad Dombrova has been working on major changes to bfiles to try
>       to address various shortcomings; in the worst case, he might
>       have to fork an incompatible version, but we have been
>       communicating and both of us very much want to avoid a fork

Is that the fork mentioned here, or is this unrelated?

http://kiln.stackexchange.com/questions/1929/why-kbfiles-instead-of-improving-bfiles

>   * snap: large (according to Adrian); radically different design

The snap extension is twice as big as bfiles (~4,000 lines) and it comes
with 97 test scripts, though only 23 of them mention 'snap'.

> As has been pointed out, I wrote bfiles because bigfiles did not work
> for me, despite the attempt by its author and me to get it to work.
> They have a similar basic design, but bfiles has a lot more features.
> In particular, it has wire protocols for getting big files back and
> forth to a central store; as I understand it, bigfiles leaves that up
> to the user.
>
> My bottom line on snap: let a thousand flowers bloom.  I don't know if
> putting metadata explicitly in the filelog is better than doing it
> implicitly, but it sounds promising.

I've been asked by Klaus' company to take a look at the snap extension
with the aim that it will eventually be included with Mercurial.

Even before being contacted by them, I felt that handling big files with
a supported extension would be a great feature to have. Before knowing
anything about the snap extension, I had hoped we would eventually ship
the bfiles extension since I heard good things about it :)

Klaus then pointed out your earlier mails[1] about some problems in
bfiles and explained how the in-place files make renaming easier. So I
now hope we can eventually ship the snap extension or something similar.

  http://markmail.org/message/43c3dlh3yq5ksbac

I think the ideal step forward would be for you and Klaus to discuss the
design of both extensions with the aim of merging the best of both. But
I realize that both of you have fulltime jobs to take care of besides
hacking on these extensions, so I understand if you want to take the
path of least resistance and continue with bfiles/snap as they look now.


-- 
Martin Geisler

aragost Trifork
Professional Mercurial support
http://aragost.com/mercurial/


More information about the Mercurial-devel mailing list