Handling Large Files

Mercurial's internal file-handling limitations.

1. Mercurial handles files in memory

The primary memory constraint in Mercurial is the size of the largest single file being managed. This is because Mercurial is designed for source code, not bulk data, and it's an order of magnitude faster and easier to handle entire files in memory than to manage them piecewise.

2. Current internal limits on Mercurial file size

3. Platform limits

Thus, 32-bit versions of Mercurial on Windows may run into trouble with single files in the neighborhood of 400MB. 32-bit Linux executables can typically handle files up to around 1GB with sufficient RAM and swap.

64-bit Mercurial will instead hit the internal 2GB barriers.

4. Future Directions

With some small changes, the 2GB barriers can probably be pushed back to 4GB. By changing some index and protocol structures, we can push this back to terabytes, but you'll need a corresponding amount of RAM+swap to handle those large files.

To go beyond this limit and handle files much larger than available memory, we would need to do some fairly substantial replumbing of Mercurial's internals. This is desirable for handling extremely large files (video, astronomical data, ASIC design) and reducing requirements for web servers. Possible approaches to handling larger files:

The mmap approach doesn't really help as we quickly run into a 3GB barrier on 32-bit machines.

The magic string technique would require auditing every single use of the string to avoid things like write() that would instantiate the whole string in memory.

If we instead declare that we pass all file contents around as an iterable (list, tuple, or iterator) of large multi-megabyte string fragments, every user will break loudly and need replacing with an appropriate loop, thus simplifying the audit process. This concept can be wrapped in a simple class, but it can't have any automatic conversion to 'str' type. As a first pass, making everything work with one-element lists should be easy.

Fixing up the code:

The mpatch code can be made to work on a window without too much effort, but it may be hard to avoid degrading to O(n²) performance overall as we iterate through the window.

The core delta algorithm could similarly be made to delta corresponding chunks of revisions, or could be extended to support a streaming binary diff.

Changing compression and decompression to work on iterables is trivial. Adjusting most I/O is also trivial. Various operations like annotate will be harder.

Extending dirstate and revlog chunks to 4G means going to unsigned pack/unpack specifiers, which is easy enough. Beyond that, more invasive format changes will be needed.

If revlog is changed to store the end offset of each hunk, the compressed hunk length needn't be stored. This will let us go to 48-bit uncompressed lengths and 64-bit total revlogs without enlarging the index.

5. Current Workarounds

A few extensions have been designed specifically for handling big files. The general idea is that those big files are kept outside of the repository but there's a mechanism to retrieve and snapshot the appropriate version of those files inside your working directory. See:


CategoryInternals

HandlingLargeFiles (last edited 2013-02-03 20:55:49 by mpm)