RFC: version (big) file snapshots with storage outside a Mercurial repo with snap

Klaus Koch kuk42 at gmx.net
Mon Aug 16 14:10:08 CDT 2010


Martin Geisler <mg <at> aragost.com> writes:

> This is very nice! It is my impression that the bfiles extension by Greg
> also aims at hooking into the normal add, commit, push, pull commands
> but that this just hasn't been done yet.
> 
> Do you know of other differences between your extension and the two
> other extensions? I'm asking since having three extensions for this task
> is two too many and it is a shame to see such duplication of effort.

Yes, I did not want to write another extension for handling big files.
In fact, I tried both bigfiles and hg-bfiles.  I even added new
commands to bigfiles (never published).

The main difference really is the architecture and the IMHO more
transparent use and development.


The bigfiles extension maintains a 'special' file .bigfiles under root
with entries "file_name sha1\n", whereas the hg-bfiles extension
maintains a directory .hgbfiles in order to store the names and their
contents' sha1, e.g. ".hgbfiles/file1" with its sha1 as content.  Snap
OTOH, stores that information directly into the filelog meta and data.

Now consider how you would implement a move command for bigfiles or
hg-bfiles.  You must first check whether that file is big, i.e., is
mentioned in .bigfiles or is in .hgbfiles.  (Should you check that in
the working directory context, or also in the parents contexts?)  Then
you have to rename the entries in .bigfiles or .hgbfiles, *and*
perform the move in the working dir.  What you end up is basically a
re-implementation of functionality which is available in Mercurial
already.

With snap, the move is done by Mercurial as usual.


Greg already discussed the pros and contras of using one .bigfiles
vs. a directory .hgbfiles with files.  I think he is right in that, so
I do not repeat it here.
When you store that information in the Mercurial data base/store
instead, you save a lot of added infrastructure for exchanging that
information and manipulating it.  Due to this, the snap extension is
IMHO more feature complete than bigfiles or hg-bfiles.


The caching is done in snap as usual.  Snapped files are stored in the
local clone, per default.  When you push the changes to a 'central'
repository, the snapped files' data is pushed as well.  Of course, you
can also configure the snap-store path to use a central store, and
there is also a snap-store-push.  So the approach is quite like the
usual push and pull in Mercurial.  The cache structure in bfiles seems
more complicated to me.


Another difference is the 'attitude'.  As far as I understood, for
Greg it was important to control explicitly what files are checked in
as big files.  In my company, we have many users who want to use the
same command for checking in a file.  (It seems, you can now configure
the Mercurial commands to execute the bfiles commands like bfadd,
bfrefresh, bfput etc.)  None of the 'attitudes' is more right than the
other, but for me it was important that it is impossible by default to
accidentally commit files with compressed deltas bigger than
Mercurial's limit of 4GiB.  (Actually, snap sets the hard limit to
750000000 Bytes as this is the effective limit for 32 bit systems
regarding free memory address space.)

The different 'attitude' shows also in the way the commands behave.
An 'hg status' will print the usual states and files.  There is no
'B-M' etc. status.  If one wants to see the status of all (to be)
snapped files, one can use 'hg status --snapped' to print only those.


The code of snap is in some way a blue print of what one would need to
add where in Mercurial's core to handle big files.  I tried hard to
avoid reimplementing Mercurial functionality, and searched instead for
the single point where the smallest change added support for
snapped/big files.


> > Sha1 collisions may be very unlikely, but they are not impossible.
> 
> As a cryptographer, I will say that they are impossible in practice, as
> least at this time. If you are lucky enough to find a collision, you
> should not work around it, you should instead save the data carefully
> and send a mail to an internaltional crypto conference :)

I'll do that :)  It certainly is practically impossible to provoke such
a collision deliberately, but it may happen accidentally.


> Finally, I noticed a lot of places in the code that read
> 
>     finally:
>         del(fsrc)
>         del(fdst)
>         del(frep)
> 
> just before the end of the function/method.
> 
> Is that meant to trigger the close method of those file handles? If so,
> then calling close explicitly is better since there is not guarantee
> exactly when the objects are reclaimed.
> 
> The normal CPython implementation will clone the files when you delete
> the last reference to them since it uses a reference counting scheme,
> but this is not true for the other Python implementations that use a
> more modern garbage collector.

It follows the advised method in
http://mercurial.selenic.com/wiki/DealingWithDestructors

The reasoning is that we do not know whether fsrc, fdst, or frep were
created successfully, i.e., if they are file objects with a close
method.  For example, fsrc may be opened successfully, but fdst raises
an exception.  Then we could close fsrc, but not fdst and frep.  Of
course, we could use
if fsrc:
 fsrc.close()
if fdst:
 fdst.close()
if frep:
 frep.close()
Hm ...



More information about the Mercurial-devel mailing list