Space savings: cherry picking, partial cloning, overlays, history trimming, offline revisions, and obliterate

Jared Hardy jaredhardy at gmail.com
Fri Mar 21 20:24:01 CDT 2008


On Fri, Mar 21, 2008 at 12:36 PM, Matt Mackall <mpm at selenic.com> wrote:
>  > Oh yes, we go over 10MB files routinely. Is there any configuration
>  > setting to set the threshold higher? We would probably bump that up to
>  > a few hundred MB, in our use case.
>
>  It's just a warning, and the number is arbitrary. 10MB files may present
>  trouble on machines with, say, 64MB of memory, which is not uncommon for
>  virtual servers.

OK, so I take it is a calculation done based on local system memory,
based on total or available? Do you have a rough formula for when this
warning kicks in, like (revA + revB + scratch-constant > available)?
Most of our current workstations have 2GB minimum, but Windows uses a
big chunk of that. If we have financing luck in the next year, some
stations may have a 2GB Windows-32 VM under Linux-64 4GB native host.

>  It applies to the actual file. Doing delta storage efficiently means
>  having two full revisions in memory and a fair amount of scratch space.
>
>  http://www.selenic.com/mercurial/wiki/index.cgi/HandlingLargeFiles

So I take it from this, that our file size is roughly limited to 1GB,
assuming we have any users with a 32-bit OS around (Win32). Changesets
are also limited to 2GB per rev? I guess automating commit splits into
<2GB chunks, maybe via pre-commit hook script, might help us avoid
this limit.
    Is there anything in the code right now that just skips the whole
diff/compress/raw size compare step, and just uses raw snapshots, when
the memory threshold is exceeded? Are data revlogs over the 2GB file
limit ever split up into multiple *.d (*.1.d, *.2.d, etc.) files?

    If I remember correctly, Subversion uses a window binary diff to
get past memory size limits. The Wiki above mentions that method, but
then also mentions simpler streaming or split-then-diff methods. I
haven't tested this theory, but my knee-jerk reaction is that streamy
or chunky diffs are fine for streamy or chunky format files, like
uncompressed tarballs, or raw video. Those are probably the most
common use case over 1GB. I think 3D files are more like db trees, so
that kind of binary organization can probably only get optimal diffs
from sliding window O(n^2) methods, unfortunately. Even raw video file
diffs require nice boundary definitions, before they approach optimal
diff. Maybe we can define "diff method hint" entries by file type, in
a configuration file? That could also be used to bypass size checks on
pre-compressed file types entirely. I personally think sliding window
diff methods will give sufficient performance, as long they are only
used when the in-memory diff method doesn't have sufficient memory
available. That should remain the minority of use cases, even in our
3D video work.

    Most of our 3D files are within the  2-digit MB range right now,
but the largest cinematic animation files can be in the 3-digit MB
range. Clean separation between scenes and reference mesh instances
can help keep this down to reasonable sizes. This is bound to increase
tenfold or more, as our polygon counts rise. Hopefully this wont
happen until we're all able to fully upgrade to a real 64-bit OS. ;)

    Thanks!
    Jared


More information about the Mercurial mailing list