MemoryError during commit

Thu May 31 10:01:51 CDT 2007

On Thu, May 31, 2007 at 12:36:57PM +0100, Robert Allen wrote:
> Hello,
> Please find attached a log of a MemoryError I encountered during a
> commit operation. (Just a simple "hg ci".) I was adding three binary
> assets to the repository: a 887Mb file, a 134Mb file and a 2.9Mb. Also,
> I was commiting changes to some small text files. The output from hg st
> looks like:
> 
> M source\InternalUtilities\Profiler.cpp
> M source\unittest_harness\TrialCode.cpp
> A source\testsuite\<LargestBinaryFile>
> A source\testsuite\<LargeBinaryFile>
> A source\testsuite\<SmallBinaryFile>
> 
> Whilst it is a lot of data, I've got 2Gb of memory and plenty of hard
> disk space/virtual memory available, so I wouldn't expect it to fail.
> Anyway, hope this helps.

I'm a little surprised it didn't work too. But it's perhaps for the
best, as your next commit would have certainly broken. Checking in a
a delta to a ~1G file basically can't be done on a 32-bit machine.

I just did a test of checking in a 10MB file with 1M lines while
measuring memory usage:

Adding a file: 2.5MB RSS, <1s runtime 
Initial check in: 19.8MB RSS, ~3s runtime
Add a minor change: 73.1MB RSS, ~90s runtime
Checkout: 19.8MB RSS, <1s runtime

A large checkin should take up to 2x the file size in memory because
we're compressing it. I was expecting adding a change to take 4-6x,
but it looks like that's about 7x in this case.

Why so much? Efficient delta algorithms basically all need the
entirety of both versions in memory. Then we also need a list of lines
for each (at a million lines, that can be a lot), and a hash table for
looking up matching lines, and enough space to hold the resulting
delta.

We could probably shrink this, but we have other problems: the delta
algorithm we use, like all line-oriented algorithms, is O(n^2). Which
is why this took 90 seconds. Checking in a delta to a 100MB file could
have taken hours. On small files, of course, it's negligible. We could
use an O(1) delta algorithm on large files, but it hasn't been a
priority.

To get down below 1x so that people can commit 1G and 10G files, we'd
need to rewrite most of the system (which assumes files can fit in
memory). And that means making it much slower for everyone. Which, I'm
afraid, is not what most people want.

Here's an ugly workaround: break the file into a series of <10MB
chunks and use a makefile to reassemble it.

-- 
Mathematics is the supreme nostalgia of our time.