Convert to parentdelta

Sun Aug 22 11:45:43 CDT 2010

On 22 Aug 2010, at 16:39, Matt Mackall wrote:

> On Sun, 2010-08-22 at 13:41 +0200, Dan Villiom Podlaski Christiansen
> wrote:
>> The resulting 00manifest.d sizes:
>>
>> normal:     1342MB
>> parentdelta: 161MB
>> compressed:  161MB
>> shrunk:       31MB
>> compressed8:  19MB
>
> compressed+shrunk might be interesting too.

> Can you send us results for manifest compression between 2x and 8x?

Sure, here's an updated listing:

normal:            1342MB
parentdelta:        161MB
compressed2:        161MB
compressed3:         49MB
compressed4:         32MB
shrunk:              31MB
compressed2-shrunk:  31MB
compressed4-shrunk:  29MB
compressed6:         24MB
shrunk-compressed2:  22MB
compressed8:         19MB

(Shrunk & compressed are listed in the order performed. I only did  
shrunk-compressed2 in that order, as compression is *very* CPU  
intensive.)

>> As can be seen, the current implementation results in fairly good
>> compression, but with room for improvement. My guess is this is  
>> caused
>> by a slight bias against parent deltas in the current code.
>> Specifically, the distance that is used for comparing against the raw
>> text is calculated like this:
>>
>> dist = l + offset - self.start(base)
>>
>> It seems to me that this isn't terribly meaningful for parent deltas.
>> I suspect calculating the actual distance would be somewhat costly,  
>> so
>> perhaps it would be better to store the actual distance to base  
>> either
>> alongside or instead of the base revision?
>
> Not sure what you mean here, as this is the "actual distance": the
> amount of the disk needing to be read to pull this data in. Perhaps  
> you
> mean "sum of length of deltas we need to read in". That number is less
> interesting - we really want to read this all in with one read  
> request.
> Otherwise, it'll degrade into possibly thousands of blocking
> seek()/read() ops where we'll be waiting on I/O and getting  
> rescheduled
> between each.
>
> That scale factor is important too. Making retrieval of all files take
> four times longer and four times as much memory isn't something we
> should do lightly. But it's worth investigating, especially in the  
> case
> of the manifest.

Ah, I see. I wasn't aware that Mercurial would have to parse all the  
intermediate revisions :)

--

Dan Villiom Podlaski Christiansen
danchr at gmail.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1943 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20100822/b29b6cdf/attachment.bin>