RFC: version (big) file snapshots with storage outside a Mercurial repo with snap

Thu Aug 19 15:04:48 CDT 2010

On Aug 19, 2010, at 12:06 PM, Dirkjan Ochtman wrote:

> On Mon, Aug 16, 2010 at 21:10, Klaus Koch <kuk42 at gmx.net> wrote:
>> The main difference really is the architecture and the IMHO more
>> transparent use and development.
>> 
>> Another difference is the 'attitude'.  As far as I understood, for
>> Greg it was important to control explicitly what files are checked in
>> as big files.  In my company, we have many users who want to use the
>> same command for checking in a file.  (It seems, you can now configure
>> the Mercurial commands to execute the bfiles commands like bfadd,
>> bfrefresh, bfput etc.)  None of the 'attitudes' is more right than the
>> other, but for me it was important that it is impossible by default to
>> accidentally commit files with compressed deltas bigger than
>> Mercurial's limit of 4GiB.  (Actually, snap sets the hard limit to
>> 750000000 Bytes as this is the effective limit for 32 bit systems
>> regarding free memory address space.)
>> 
>> The different 'attitude' shows also in the way the commands behave.
>> An 'hg status' will print the usual states and files.  There is no
>> 'B-M' etc. status.  If one wants to see the status of all (to be)
>> snapped files, one can use 'hg status --snapped' to print only those.
>> 
>> The code of snap is in some way a blue print of what one would need to
>> add where in Mercurial's core to handle big files.  I tried hard to
>> avoid reimplementing Mercurial functionality, and searched instead for
>> the single point where the smallest change added support for
>> snapped/big files.
> 
> Okay, but it seems you did not publish your concerns about bfiles'
> approach before embarking on coding your own, correct?
> 
> You may dismiss Adrian's concerns about code size as FUD, but it would
> seriously have helped your case if there was a public conversation
> between you and Greg and Andrei about design for these features. Now,
> we just have your big wad of code (written by just you in five days)
> and their big wad of code (written by a bunch of people over the
> course of a few months, and in actual use at several companies IIUC).
> 
> I'd hoped Greg Ward would comment here about his thoughts on your
> extension, but he hasn't done so yet. Anyway, it would be more
> productive IMO if you talked to him about moving bfiles in the
> direction you want (or snap in the direction of things he needs,
> though I doubt that would be the more productive approach). That way,
> we might actually converge on an extension that works for most people
> and thus at some point something we could support in hgext.
> 
> Cheers,
> 
> Dirkjan

I checked out both bigfiles and bfiles in earnest starting October last year.  At that time I succeeded in getting bigfiles to run and failed with bfiles.  So I sticked with bigfiles.  The next months I extended bigfiles and changed some internals.  We tested bigfiles and these changes in our company where I am employed.  We made two observations:
1. It is quite a maintenance effort to duplicate Mercurial functionality. 
2. Using extra commands like badd etc. is cumbersome and error prone.

We have many files in the MiB or even GiB range in our repositories.  Handling 'normal' files and 'big' files by different commands is IMHO no good and a terrible sell to our developers who are used to a central revisioning system which handled all files equal and well.  Of course, some of our developers were/are not happy with that central system (mainly those who have to use it on Windows) and some made the experience that a central system is a single failing point.  After all, there had to be some reason(s) why we as a company wanted to move to Mercurial as fast as possible.

The irony is that Greg Ward started bfiles, because he could not get bigfiles running for him: http://markmail.org/message/ikqevkfsodklvd5l

The idea that one could store a pointer to the data of big files in the Mercurial revlog is not new and I am certainly not the first to propose it.  Several people have pointed that out, also in this mailing list.

In March this year, I had have some vacation.  I couldn't relax, always thinking that this should done simpler and having the terrible feeling that we were doomed if we used bigfiles or bfiles in our company.  So I started to code it.  At first, I didn't put it in a repository at all---remember, I was on vacation, no "work" was allowed.  

The first draft added a new 'snap' flag to the manifest.  This worked quite fine, but was a change in the manifest format.  I checked Mercurial's code base and found, I think, three lines which would have to be changed in order to handle the new 'snap' flag.  So I mailed to this mailing list, asking what the Mercurial's opinion was about such a new flag: http://thread.gmane.org/gmane.comp.version-control.mercurial.devel/30630/focus=30634  Benoit Boissinot stated very clearly that such a new flag was not good, but pointed out that I may use the filelog.meta.  So I did, still on my vacation.

In that intercourse, Greg Ward gave me some kind and helpful comments, and we discussed very briefly the performance of the status function.  So I tried out his bfiles extension again, and it worked this time for me.   OTOH, he had stated in his requirement document and in his proposal that he did not like any automatic/implicit detection of 'big' files.  As I have stated before, he has all right for that 'attitude'.  It simply seems to our use case and for our developers not practical.  (I am guilty of not comparing bfiles' performance with snap, but at that time it was clear that I wanted a 'transparent' behavior with no special commands as much as possible, and no special directories in the working directory.  Later, my colleagues stressed the same point.)

Back at work, I presented it to my colleagues.  They liked it.  So I replaced bigfiles with snap in our Mercurial installation.  A couple days later, the first person checked in several files with several MiB each.  Of course, this was the project which had claimed that they would not check in big files in the next months since they just started.

That first version seemed to work, but it did not handle enough special cases and it showed some annoying bugs, although it seemed to keep the data.  So I kept going to get rid of these bugs, revised code, tested it etc. ---all beside my normal work.  In the beginning I had decided that I would not implement any protocol for exchanging snapped files, because I am no expert in this at all, and because we most likely do not want to transfer (all) our GiB files.  However, the brief email exchange with Greg Ward let me skim again over the  source code of his bfiles extension and I realized that the protocol may be not so much beyond my reach.  What I implemented then was more inspired by the Mercurial core method, but I think he was certainly the first one who realized how it could be done in the first place and he was the one who implemented it first.

If you want to use snap, you should first test it whatever source size it has.  For sure, snap still has bugs (Martin Geisler found one recently).  It does not work with rebase and mq, yet.  It is not perfect.  I consider snap, bigfiles, and bfiles as temporary solutions until the Mercurial crew may decide to support 'big' files directly, whenever they want.

This extension is an offer, just that.  It is open source, everyone can take all or part of the code and reuse it in ones own programs/extensions, use it as source for ideas, or point it out as a bad example.  Of course, my employee allowed me to publish the source code so that it is 'maintained' in some way like other open source (we do that with/for other open source), but no-one is forced to do that or even to use it.

Klaus