Experiments with LZMA compression of bundles

Sune Foldager cryo at cyanite.org
Fri May 20 16:36:05 CDT 2011


On 2011-05-20 16:00, Matt Mackall wrote:
>On Fri, 2011-05-20 at 22:50 +0200, Sune Foldager wrote:
>> On 20-05-2011 13:34, Benoit Boissinot wrote:
>> > Yes, experimenting is cool (for fast server side compression, with
>> > better compression level than zlib, snappy/zippy would be interesting
>> > too).
>>
>> Maybe... is it LZ-based? Then maybe liblzma can do the same with proper
>> parameters set. I think the key advantage with lzma despite the
>> increased dictionary sizes etc., is the use of arithmetical compression
>> in the entropy stage (like bzip1 did).
>
>I'd be surprised if that were true. bzip2's alternative to arithmetic
>encoding is pretty decent.

Hm, no you mean bzip2's alternative to Lempel-Ziv, namely the Burrows-Wheeler
transform. After applying that, and some other stuff, bzip2 subsequently
uses standard Huffman encoding:

"bzip2's ancestor bzip used arithmetic coding instead of Huffman.
The change was made because of a software patent restriction."

I f*cking hate software patents!... anyway,

Arithmetic is known to be superior; but there could of course be
other changes from bzip1 to bzip2. As for efficiency, the WP article
sums it up as:

"bzip2 compresses most files more effectively than the older LZW (.Z)
and Deflate (.zip and .gz) compression algorithms, but is considerably
slower. LZMA is generally more efficient than bzip2, while having much
faster decompression."

(both quotes cite sources).

>You should also try running rzip on an uncompressed bundle for
>comparison. It's like bzip with an infinite window.

One of LZMA's advantages is an increased window size (compared to zip),
but I don't know how big. I can try doing a comparison.

...ok, I did a comparison before I managed to hit 'Send':

uncompressed bundle: 939M
gzip: 206M (27.5s)
bzip2: 111M (3m18s)
rzip: 27M (19.7s)
xz: 23M (4m7s)

Amazing job by rzip!... unfortunately, it's not a streaming compressor
as its power comes from its initial long-distance check of redudancy.
But it's quite impressive how quick it is, and at the same time almost
as efficient as lzma (xz).

>Usually when the topic of alternate compressors comes up, I point out
>that adding a new dependency is painful and that having clients create
>bundles that can't be read by other clients is also not great.

Well, I have to agree with that. We might toss in an extension that
adds the compression for on-disk bundles only, and use it internally
at work where people shuffle a lot of them around, but that's of couse
another story.

-Sune


More information about the Mercurial-devel mailing list