Space savings: cherry picking, partial cloning, overlays, history trimming, offline revisions, and obliterate

Sat Mar 22 00:05:22 CDT 2008

On Fri, 2008-03-21 at 18:24 -0700, Jared Hardy wrote:
> On Fri, Mar 21, 2008 at 12:36 PM, Matt Mackall <mpm at selenic.com> wrote:
> >  > Oh yes, we go over 10MB files routinely. Is there any configuration
> >  > setting to set the threshold higher? We would probably bump that up to
> >  > a few hundred MB, in our use case.
> >
> >  It's just a warning, and the number is arbitrary. 10MB files may present
> >  trouble on machines with, say, 64MB of memory, which is not uncommon for
> >  virtual servers.
> 
> OK, so I take it is a calculation done based on local system memory,
> based on total or available?

No, the warning happens at 10MB. Always. If you commit on a machine with
4GB and push it to a server with 64MB, you want to know you're breaking
things before the commit actually happens. 10MB was just a number chosen
to be safe across most machines.

>  Do you have a rough formula for when this
> warning kicks in, like (revA + revB + scratch-constant > available)?
> Most of our current workstations have 2GB minimum, but Windows uses a
> big chunk of that. If we have financing luck in the next year, some
> stations may have a 2GB Windows-32 VM under Linux-64 4GB native host.
> 
> >  It applies to the actual file. Doing delta storage efficiently means
> >  having two full revisions in memory and a fair amount of scratch space.
> >
> >  http://www.selenic.com/mercurial/wiki/index.cgi/HandlingLargeFiles
> 
> So I take it from this, that our file size is roughly limited to 1GB,
> assuming we have any users with a 32-bit OS around (Win32).

1GB is on the edge of what's possible on a 32-bit machine and assumes
you can back the rest of the 3G address space with swap (meaning it gets
slow here). By doing non-delta storage, we could probably push pretty
close to 3G.

> Changesets
> are also limited to 2GB per rev?

Yes, but only in the sense that your changeset description and your list
of filenames can't exceed 2GB. Hopefully that won't be a problem.

>     Is there anything in the code right now that just skips the whole
> diff/compress/raw size compare step, and just uses raw snapshots, when
> the memory threshold is exceeded? Are data revlogs over the 2GB file
> limit ever split up into multiple *.d (*.1.d, *.2.d, etc.) files?

Individual revlogs can be up to 16TB. Also not a practical limit.

>     If I remember correctly, Subversion uses a window binary diff to
> get past memory size limits. The Wiki above mentions that method, but
> then also mentions simpler streaming or split-then-diff methods. I
> haven't tested this theory, but my knee-jerk reaction is that streamy
> or chunky diffs are fine for streamy or chunky format files, like
> uncompressed tarballs, or raw video. Those are probably the most
> common use case over 1GB. I think 3D files are more like db trees, so
> that kind of binary organization can probably only get optimal diffs
> from sliding window O(n^2) methods, unfortunately. Even raw video file
> diffs require nice boundary definitions, before they approach optimal
> diff. Maybe we can define "diff method hint" entries by file type, in
> a configuration file? That could also be used to bypass size checks on
> pre-compressed file types entirely. I personally think sliding window
> diff methods will give sufficient performance, as long they are only
> used when the in-memory diff method doesn't have sufficient memory
> available. That should remain the minority of use cases, even in our
> 3D video work.
> 
>     Most of our 3D files are within the  2-digit MB range right now,
> but the largest cinematic animation files can be in the 3-digit MB
> range. Clean separation between scenes and reference mesh instances
> can help keep this down to reasonable sizes. This is bound to increase
> tenfold or more, as our polygon counts rise. Hopefully this wont
> happen until we're all able to fully upgrade to a real 64-bit OS. ;)

Interested to hear what kind of performance numbers you get out of it.
Mercurial is primarily tuned to large numbers of small to medium-sized
files.

-- 
Mathematics is the supreme nostalgia of our time.