Space savings: cherry picking, partial cloning, overlays, history trimming, offline revisions, and obliterate

Jared Hardy jaredhardy at gmail.com
Fri Mar 21 02:36:34 CDT 2008


I'm not a student, but I'm very interested in "partial cloning"
(referring to a related Google SoC thread).

I'm new to Mercurial, but I've been working with Subversion+SVK for
about a year now, so I'm familiar with many of the problems related to
distributed VCS, and both the benefits and problems with coordinating
distributed developers, even with a central shared repository.
Cherry-picking, "partial cloning", and history trimming are all very
important to BIG projects, where not everyone is working on the same
branches of a large folder tree, nor do they have the same amount of
storage available. The "Forest" feature seems interesting, but it
feels a little incomplete, to me.

    When I hit a couple of stumbling blocks with SVK on Windows, I
went searching for more alternatives again. One of my biggest concerns
is repository size, because I frequently deal in art and video
pipelines, which include large binary files (3D, images, sound, raw
video). That would all be considered "source", in our pipeline. Art
pipeline build processes can take a long time, and multiple intricate
steps, so even tracking build outputs and interim files via VCS is
sometimes useful. Yes, we could just shovel over the money for
AlienBrain, or a giant SAN with Perforce. I'm biased towards Open
Source, and no "art" VCS seems sufficiently efficient (yet),
especially in network or storage space use.

So 3 main things caught my eye when I started reading about Mercurial:

 1. Binary diffs (like Subversion) -- a basic necessity for art binary
source tracking.

 2. Central working copy cache, via the local repository "clone" copy (.hg).
        We would be happy with Subversion, except that all recursive
status dependent operations take a long time on Windows, because
traversing all the .svn folders is slow on NTFS. We've found .svn
pollution is a problem to many Windows tools.That is the main reason I
started using SVK. I think SVK is a gateway opiate for DVCS. Mercurial
could accomplish the same opiate status, I think, with an svn
push/pull feature. ;) iNotify on Windows would be a category killer!

 3. Repository layout looks very clean, with a nice separation between
the file store (data) and the tree metadata (index).
        This is the main reason I am so interested in Mercurial. Also,
it seems like each file has its own individual history store (revlog
data file), which I thought would make features like cherry picking
(aka Overlays, Partial Cloning), history trimming, or even
"obliterate" easier to implement. Perforce and AlienBrain are two (of
very few) other SCM systems that have per-file history data storage,
and they already support these features. Perforce is not space
efficient on the repository though, because it doesn't ever use binary
diffs (yet). I don't know if AlienBrain is still as space inefficient,
but the last time I checked (years ago), it had the same problem, and
it was painfully slow on our gigabit LAN. Mercurial snapshots seem
like a good starting point, for determining efficient places to prune
history.

Based on all these aspects, I started reading all discussions on
space-saving features in Mercurial, of all kinds. I was disappointed
by much of the conversation in this thread about the potential for an
"obliterate" feature:

http://www.selenic.com/pipermail/mercurial/2008-March/017802.html

But I did learn a lot about Mercurial checksum internals, from parts
of this thread. In particular, this idea seems VERY interesting to me:

On Wed, Mar 19, 2008 at 2:05 PM, Marcin Kasperski
<Marcin.Kasperski at softax.com.pl> wrote:
> Patrick M?zard <pmezard at gmail.com> writes:
>
>  > It's clear that no solution preserving the hashes is likely to come
>  > up quickly, it's breaking deep invariants which makes it hard to
>  > implement and it would break existing clients. What about discussing
>  > real-world use cases instead, so we can come up with better history
>  > rewriting tools ?
>
>  hashes routines work in a stream basis. I can imagine replacing
>  obliterated file with the information what was its hash and using this
>  value for further calculation
>
>  Say you had changeset [X][B][C] with sum N. X is to be obliterated.
>  So you just save the state of hash calculation after X is processed,
>  and since then just start calculation for B and C with different
>  starting value....
>
>  If it is [B][X][C], then you can preserve separately sum after B for
>  additional verification, then use the partial sum after B and X as
>  above.
>
>  (yeah, I know, that is just very looose idea, not sure how difficult
>  would it be to implement this)

Now bear with me for a bit. This is where I substantiate how cherry
picking (Overlays, Partial Cloning), history trimming, and obliterate
are all related. and how their implementation may even overlap...

References:
http://www.selenic.com/mercurial/wiki/index.cgi/TrimmingHistory
http://www.selenic.com/mercurial/wiki/index.cgi/OverlayRepository
http://www.selenic.com/mercurial/wiki/index.cgi/PartialClone
http://www.selenic.com/pipermail/mercurial/2008-March/017888.html

    Let's just say, for sake of argument, that a stand-in hash data
replacement method (hinted at by the thread above) was already
implemented, tested, and verified to work. With this, at any time you
want, you can just take ANY revision you want, out of a revlog's data
file, as long as you store each revisions' hash data in the revlog's
index. Sure, this would require rewriting the data files, without
retaining that nice append-only feature of most Mercurial IO; but this
is a very special case, so the performance loss is acceptible. Put as
many automated dog guards around it as you want: somebody is going to
risk it, because it's just that valuable. It's definitely cheaper than
buying a massive SAN for each and every user. ;)
    Going further, let's say that this in-place replacement hash data
is always accompanied by another important piece of metadata, in the
revlog index: a *reason* for the missing data. Here are some example
"missing data reasons" (paraphrased for human legibility):

* "I don't have enough disk space, so if you really need this file
version, get it from this other URL: ..."
    Give Mercurial a method to parse this "missing data reason", to
automatically retrieve the data from the remote URL when needed, and
now we've implemented Overlays, and also Externals.

    * Add "I don't want this, personally", to the reason above, and
you even have Partial Clones.

* "My lawyer doesn't like this version of this file. Please forget you
ever saw it."
      That just added History Trimming.

    * Allow the WHOLE revlog data file to be replaced by in-place hash
references, all using this same "missing data reason", in the revlog's
index, from rev 0 to tip. Add a flag to prevent pulling in any more
revisions. The result is a revlog index full of hashes, with empty or
no matching data file. Now you have Obliterate.

 *  "I don't have enough online disk space, so I put this on some
offline storage media, a DVD-R. It's labeled with this text: XXXX1.
Ask an john at smith.net for a copy of the media, if you really do need
it."
      That feature would be fairly similar to one used by the
artist-centric AlienBrain SCM. I think they call it "bucketing". Maybe
we call it "offline overlays"? Maybe this "reason" could be combined
with an optional URL, pointing to someone else who is rich enough to
keep it all online, on their giant SAN.

    Allow other users to parse this "reason metadata" on pull, and
each decides if they want to use the same in-place metadata, for
whatever personal reasons, that don't have to be the same "reason" as
the pull source. Missing data with URLs in their "reason metadata"
could be interpreted, and pulled from the third-party repository
source, if desired. These choices can be automated, via configuarion
settings of some sort, per clone. Or maybe command line options would
be enough. That should really be up to each user, I guess. Defaults
are worth further discussion.

    Does anyone see other huge issues with this? I mean other than
obvious: difficulties in implementing the "in-place replacement hash"
algorithm, and the destructive revlog data writes. Revlog indices can
still probably be append-only, but would it be desirable to rewrite
them for any reason?

I haven't done any Mercurial coding yet, but this feature would be
exciting enough to get me started.

    Any and all experienced input is greatly appreciated; but I'm not
interested in any religious wars. I obviously disagree with the mantra
"all clones should be forced to store all the same data online". This
proposal is obviously antithetical to that view.
    I actually think each folder in a repository tree should be viewed
as a potential repository in itself, very much like how Subversion
operates. Each "branch" is just another sub-folder, copied from
another point in the same repository tree. Users are allowed to check
out (fetch) any sub-tree they wish, from root all the way down to each
leaf folder. Every in-repository copy is like a hardlink, which only
diverges from the link point when new revisions are added. That
hardlink point is the common base until merge, where the hardlink
re-joins again. I think Mercurial could adopt the same flexible tree
manipulation properties as Subversion, but that's beyond the scope of
this proposal. I don't know enough details about parent folder
relationships in Mercurial (yet) to say if that's possible. I thought
you should know that opinion here, only because it is semi-related to
cherry-picking methods.

    Thanks!
    Jared


More information about the Mercurial mailing list