Native support for lz4?

Gregory Szorc gregory.szorc at gmail.com
Fri Aug 5 17:03:27 EDT 2016


On Fri, Aug 5, 2016 at 12:20 PM, Siddharth Agarwal <sid at less-broken.com>
wrote:

> On 8/5/16 12:09, Augie Fackler wrote:
>
>> On Fri, Aug 05, 2016 at 10:48:03AM -0700, Gregory Szorc wrote:
>>
>>> Facebook introduced an lz4revlog extension a while ago. I think lz4 has
>>> some compelling performance advantages over zlib for revlog storage and
>>> wire protocol compression.
>>>
>>> I'd like to start a discussion about bundling the lz4 C implementation as
>>> part of the Mercurial distribution and supporting lz4 for revlogs and
>>> wire
>>> protocol compression out of the box.
>>>
>>> I'm not proposing requiring lz4 or making lz4 the default. I mostly care
>>> about making lz4 accessible to more users. (The 3rd party lz4revlog
>>> extension is difficult to use because you need a separate Python package
>>> providing lz4 support. Plus, lz4revlog isn't using the proper lz4 framing
>>> encoding and I'm hesitant to recommend its use because of this.)
>>>
>> Yes, we should definitely not use the existing python-lz4 in hg itself
>> - the one-off framing format makes me sad.
>>
>
> Agreed.
>
>
>> I'd also entertain scope bloating the conversation to including other
>>> compression formats. Once you support 2, you need to support N, right?
>>> I've
>>> been taking an interest in zstd and I'd be curious if Facebook, others
>>> have
>>> any plans to add support to Mercurial.
>>>
>> I've been meaning to at least squint at this, but lack the round
>> tuits. I'm definitely open to this line of inquiry in general,
>> including the idea of bundling lz4 or adding better hooks for it in
>> core.
>>
>
> We may want to wait for zstd. It's just plain better than gzip on every
> axis, but from what I gather it's *extremely* close to being ready.
>
> I agree that going from 1->2 is harder than going from 2->N, but we really
> must avoid recompressing on pulls. It's not clear to me how that would work
> in a world where users can pick between Mercurial repositories compressed
> with any of lz4, zstd or gzip.



Yes, avoiding excessive decompression/compression on the server would be
important. But consider how poorly we currently do things.

Today, when you request a bundle on the server, the server first obtains a
changegroup. The changegroup contains a series of mdiff.textdiff's for all
the changelog, manifest, and filelog data. These are obtained by
decompressing full text revisions from the revlog and generating a new
mdiff (there is no fastpath to reuse deltas from the revlogs AFAICT). The
changegroup is stuffed into a "bundle" container and the resulting stream
of bits gets zlib compressed by the HTTP protocol and stays as uncompressed
over SSH (we defer compression to the SSH protocol). So, we're already
incurring a zlib decompress + compress for bundle retrieval on the server
today. We could certainly optimize this, but doing {lz4, zstd, zlib} ->
{lz4, zstd, zlib} on the server in the future would be no worse than zlib
-> zlib today.

If you want to trade disk space for CPU time, we could potentially run
side-by-side stores with N representations of data in different compression
formats. I think Durham's proposed generic store API could facilitate that.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160805/3ee14614/attachment.html>


More information about the Mercurial-devel mailing list