RFC: version (big) file snapshots with storage outside a Mercurial repo with snap

Klaus Koch kuk42 at gmx.net
Sun Aug 15 14:03:48 CDT 2010


Some files are either too big, or do not change in small enough deltas to
be stored in a source revision system like Mercurial.  They increase the
repository size, memory consumption, and run-time.

Such files show a _current_ size bigger than a threshold, or match a
configured pattern, but are no symlinks or have names starting with `.hg`.

With this extension, the data of such files is never seen by a Mercurial
repository.  Instead, their content is saved outside the repository, and
Mercurial gets an additional filelog meta `snap:1` and a string
`.snap://<data_store_filename>.<sha1>\n` as file data, where the sha1 is
computed out of the original data during commit.  The file is `snapped`.

In comparison to other extensions like bigfiles or hg-bfiles, this
extension can keep its re-implementation of Mercurial functionality down to
a minimum.  No need for handling special files or directories, and for
hiding them and the big files from Mercurial.  One uses the usual Mercurial
commands to commit, push or pull snapped files.

When a snapped file is updated, its snap string is first written into the
file in the working directory, then the hook `snapupdate` (default:
python:snap.snapupdate) is executed just before the hook `update`.

If the hook `snapupdate` can not retrieve a file, this is reported and the
file's clean working dir version containing its snap string is deleted.
Hence, snapped files can be removed from the snap store to save disk space,
and programs will never see files with their content replaced by snap
strings.

When a snapped file is commited, its original data is stored together with
some metadata into a read-only zip file named like the original file in
Mercurial's data store with the sha1 attached, and put into the snap store
path `snap-store` (default: .hg/snap/cache).  The file in the working dir
is not touched in any way.  The compression level can be configured as well
as the file suffixes for files to be stored directly.  The zip files can be
regenerated later with better compression.  One can search their content
with :hg:`cat -r <rev> <file name>`.

Sha1 collisions may be very unlikely, but they are not impossible.  So, if
a snapshot is stored with the same sha1 as another file in the snap store,
the data of the two snapshots is checked to ensure that no hash collision
occurred.  If their data differs, an integer suffix '_%d' is added to the
hash of the new snapshot.

When a changeset with snapped files is pulled, the snapped files are
hardlinked or copied from `snap-default` into `snap-store`, just before the
hook `changegroup` is called.  If a changeset is pushed, its snapped files
are hardlinked or copied from `snap-store` to `snap-default-push` or
`snap-default`, right before the remote's hook `changegroup` is executed.
Alternatively, one could synchronize snap stores directly with
:hg:`debugsnappull` or :hg:`debugsnappush`.

One can use Mercurial's commands as usual, e.g. :hg:`log <snapped file>`,
:hg:`merge` etc.  For the latter, snapped files are always merged with
`internal:fail`, i.e., they are never automatically merged, excepting the
environment variable HGMERGE is set to one of `internal:(local, other,
dump)`, or they are merged by a tool in `merge-tools` with the new
attribute `snap` set.  One can resolve such failed merges as usual with
:hg:`resolve.

The filelog meta `snap:1` is not kept by the convert extension delivered
with Mercurial, however, this extension adapts it, if activated.  Due to
the stored string it is always possible to repair a repository which has
been converted without this extension.  All other Mercurial commands and
protocols keep the filelog meta as is and ignore `snap:1`.

Of course, this extension is ineffective regarding Mercurial extensions
using their own methods for opening files in the working directory.  One
can set up hooks to store and restore snappy files.

Filters for transforming snapped files on checkout/checkin must be set up
as update/precommit hooks, because Mercurial's `decode/encode` mechanism
reads the entire file data into memory, whereas this extension reads the
data iteratively in blocks.  Since snap cannot know what character
combinations must be kept together for filters to work, it ensures the
filters see only the snap string.  Consequently, such filters should
reconstruct the snap string, or keep it intact.  Mercurial's
`decode/encode` filters are applied to the snap string, not to the snapped
data content.

External diff, merge, and patch programs will directly process the (big)
files in the working dir.

In case a repository with snapped files in its snap store is cloned, the
referenced snap files are pulled after the clone has been created.  For
source repos without pushkey capability, cloning takes at least three times
longer, if the source's snap store contains files, because all index files
are scanned for snapped files.  We maintain a map file in .hg/snap/filemap
and make it listable in pushkey namespace snapfilemap.  It contains the
names of reverenced snap files per revision newly reverencing them.  So
with Mercurial >= 1.6, cloning is as fast as usual.

Recipients of Mercurial bundles or patches need access to a snap store, or
referenced snapped files must be provided separately.  This could be done
with :hg:`debugsnappull` or :hg:`debugsnappush`.  Patches with snap strings
can be applied only to files in working dir which are either not yet
snapped or contain the to be replaced snap string.  That is, already
snapped files which are to be patched, must be first reverted with
:hg:`revert --nosnapped` so that the snap string is the content of the
working dir file.

The commands `status`, `cat`, `diff`, `revert`, `archive`, `verify`,
`serve`, and `convert` got new options.

With Mercurial >= 1.5, a new template `files_snapped` is provided.

The snap extension should work with Mercurial 1.4 and later, however, all
unit tests and further development focus on Mercurial 1.6.  The hgshelve
extension is supported, see the source for the tested hgshelve version.
The rebase extension is not yet supported.


Thanks to:

Andrei Vermel for bigfiles and Greg Ward for hg-bfiles extension.

Benoit Boissinot for pointing out that Mercurial manifests with a flag for
snapped files cease to be Mercurial manifests.  This caused some
teeth-grinding, but became a non-issue with the new pushkey protocol.

Thomas Arendsen Hein for suggesting that any optimization in recalculating
a file's hash should be optional so that newbies are not confused.


The repository and bug tracking of snap can be found at:

http://bitbucket.org/kuk42/hgsnap/wiki/Home


You may check out bigfiles and hg-bfiles as well:

http://bitbucket.org/avermel/bigfiles/

http://vc.gerg.ca/hg/hg-bfiles/




More information about the Mercurial-devel mailing list