[PATCH 3 of 5] sparse-revlog: implement algorithm to write sparse delta chains (issue5480)
Martin von Zweigbergk
martinvonz at google.com
Fri Jul 27 16:37:14 EDT 2018
On Fri, Jul 27, 2018 at 1:24 PM Boris FELD <boris.feld at octobus.net> wrote:
> On 27/07/2018 01:57, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
>
> On Mon, Jul 16, 2018 at 11:50 AM Boris Feld <boris.feld at octobus.net>
> wrote:
>
>> # HG changeset patch
>> # User Paul Morelle <paul.morelle at octobus.net>
>> # Date 1528179575 -7200
>> # Tue Jun 05 08:19:35 2018 +0200
>> # Node ID c34ef3def14e04dca76c43667766496a99636b44
>> # Parent 5ae60e5a705ef273d316e33df401e4c44a4c482a
>> # EXP-Topic write-for-sparse-read
>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>> # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
>> c34ef3def14e
>> sparse-revlog: implement algorithm to write sparse delta chains
>> (issue5480)
>>
>> The classic behavior of revlog._isgooddeltainfo is to consider the span
>> size
>> of the whole delta chain, and limit it to 4 * textlen.
>> Once sparse-revlog writing is allowed (and enforced with a requirement),
>> revlog._isgooddeltainfo considers the span of the largest chunk as the
>> distance used in the verification, instead of using the span of the whole
>> delta chain.
>>
>> In order to compute the span of the largest chunk, we need to slice into
>> chunks a chain with the new revision at the top of the revlog, and take
>> the
>> maximal span of these chunks. The sparse read density is a parameter to
>> the
>> slicing, as it will stop when the global read density reaches this
>> threshold.
>> For instance, a density of 50% means that 2 of 4 read bytes are actually
>> used
>> for the reconstruction of the revision (the others are part of other
>> chains).
>>
>> This allows a new revision to be potentially stored with a diff against
>> another revision anywhere in the history, instead of forcing it in the
>> last 4
>> * textlen. The result is a much better compression on repositories that
>> have
>> many concurrent branches. Here are a comparison between using deltas from
>> current upstream (aggressive-merge-deltas on by default) and deltas from a
>> sparse-revlog
>>
>
> Did you consider simply removing the restriction on span distance? See how
> FB did it in
> https://bitbucket.org/facebook/hg-experimental/commits/f60c6d7b9976ce1be85f51883d49b12b13302b27.
> I tried it on the pypy repo and it shrunk the 00manifest.d a little more
> (from 329MB with this patch to 282MB with Durham's). I couldn't measure any
> operations getting slower. Do you have any operations that you would expect
> would get slower with that patch? I tried setting maxchainlen=1000 and it
> produced an insignificantly larger 00manifest.d (~1MB), in case you're
> thinking that the smaller manifest is at the expense of very long chains.
> Or maybe the difference would mostly be noticed on cold disk? I didn't test
> that at all.
>
> We came to the same conclusion and sent a series last year to allow
> setting it up with `experimental.maxdeltachainspan=0` (see e9d325cfe071 for
> details).
>
Thanks for the pointer.
> Testing it on several repositories in the wild shown that reading the full
> unlimited span introduces memory consumption issues.
>
What operations would use too much memory? Can the effect be seen on some
open-source repo?
> The sparse-read capabilities and delta strategies are the solutions to
> these issues.
>
>
>
>> Comparison of `.hg/store/` size:
>>
>> mercurial (6.74% merges):
>> before: 46,831,873 bytes
>> after: 46,795,992 bytes (no relevant change)
>> pypy (8.30% merges):
>> before: 333,524,651 bytes
>> after: 308,417,511 bytes -8%
>> netbeans (34.21% merges):
>> before: 1,141,847,554 bytes
>> after: 1,131,093,161 bytes -1%
>> mozilla-central (4.84% merges):
>> before: 2,344,248,850 bytes
>> after: 2,328,459,258 bytes -1%
>> large-private-repo-A (merge 19.73%)
>> before: 41,510,550,163 bytes
>> after: 8,121,763,428 bytes -80%
>> large-private-repo-B (23.77%)
>> before: 58,702,221,709 bytes
>> after: 8,351,588,828 bytes -76%
>>
>> Comparison of `00manifest.d` size:
>>
>> mercurial (6.74% merges):
>> before: 6,143,044 bytes
>> after: 6,107,163 bytes
>> pypy (8.30% merges):
>> before: 52,941,780 bytes
>> after: 27,834,082 bytes -48%
>> netbeans (34.21% merges):
>> before: 130,088,982 bytes
>> after: 119,337,636 bytes -10%
>> mozilla-central (4.84% merges):
>> before: 215,096,339 bytes
>> after: 199,496,863 bytes -8%
>> large-private-repo-A (merge 19.73%)
>> before: 33,725,285,081 bytes
>> after: 390,302,545 bytes -99%
>> large-private-repo-B (23.77%)
>> before: 49,457,701,645 bytes
>> after: 1,366,752,187 bytes -97%
>>
>>
>> The better delta chains provide a performance boost in relevant
>> repositories:
>>
>> pypy, bundling 1000 revisions:
>> before: 1.670s
>> after: 1.149s -31%
>>
>> Unbundling got a bit slower. probably because the sparse algorithm is
>> still
>> pure
>> python.
>>
>> pypy, unbundling 1000 revisions:
>> before: 4.062s
>> after: 4.507s +10%
>>
>> Performance of bundle/unbundle in repository with few concurrent branches
>> (eg:
>> mercurial) are unaffected.
>>
>> No significant differences have been noticed then timing `hg push` and `hg
>> pull` locally. More state timings are being gathered.
>>
>> Same as for aggressive-merge-delta, better delta comes with longer delta
>> chains. Longer chains have a performance impact. For example. The length
>> of
>> the chain needed to get the manifest of pypy's tip moves from 82 item to
>> 1929
>> items. This moves the restore time from 3.88ms to 11.3ms.
>>
>> Delta chain length is an independent issue that affects repository without
>> this changes. It will be dealt with independently.
>>
>> No significant differences have been observed on repositories where
>> `sparse-revlog` have not much effect (mercurial, unity, netbeans). On
>> pypy,
>> small differences have been observed on some operation affected by delta
>> chain
>> building and retrieval.
>>
>>
>> pypy, perfmanifest
>> before: 0.006162s
>> after: 0.017899s +190%
>>
>> pypy, commit:
>> before: 0.382
>> after: 0.376 -1%
>>
>> pypy, status:
>> before: 0.157
>> after: 0.168 +7%
>>
>> More comprehensive and stable timing comparisons are in progress.
>>
>> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
>> --- a/mercurial/revlog.py
>> +++ b/mercurial/revlog.py
>> @@ -216,6 +216,9 @@ class _testrevlog(object):
>> def length(self, rev):
>> return self.end(rev) - self.start(rev)
>>
>> + def __len__(self):
>> + return len(self._data)
>> +
>> def _trimchunk(revlog, revs, startidx, endidx=None):
>> """returns revs[startidx:endidx] without empty trailing revs
>>
>> @@ -293,7 +296,7 @@ def _segmentspan(revlog, revs):
>> return 0
>> return revlog.end(revs[-1]) - revlog.start(revs[0])
>>
>> -def _slicechunk(revlog, revs, targetsize=None):
>> +def _slicechunk(revlog, revs, deltainfo=None, targetsize=None):
>> """slice revs to reduce the amount of unrelated data to be read from
>> disk.
>>
>> ``revs`` is sliced into groups that should be read in one time.
>> @@ -341,20 +344,27 @@ def _slicechunk(revlog, revs, targetsize
>> [[1, 2], [5, 8, 10, 11], [14]]
>>
>> Slicing with a maximum chunk size
>> - >>> list(_slicechunk(revlog, [0, 11, 13, 15], 15))
>> + >>> list(_slicechunk(revlog, [0, 11, 13, 15], targetsize=15))
>> [[0], [11], [13], [15]]
>> - >>> list(_slicechunk(revlog, [0, 11, 13, 15], 20))
>> + >>> list(_slicechunk(revlog, [0, 11, 13, 15], targetsize=20))
>> [[0], [11], [13, 15]]
>> """
>> if targetsize is not None:
>> targetsize = max(targetsize, revlog._srmingapsize)
>> + # targetsize should not be specified when evaluating delta
>> candidates:
>> + # * targetsize is used to ensure we stay within specification when
>> reading,
>> + # * deltainfo is used to pick are good delta chain when writing.
>> + if not (deltainfo is None or targetsize is None):
>> + msg = 'cannot use `targetsize` with a `deltainfo`'
>> + raise error.ProgrammingError(msg)
>> for chunk in _slicechunktodensity(revlog, revs,
>> + deltainfo,
>> revlog._srdensitythreshold,
>> revlog._srmingapsize):
>> for subchunk in _slicechunktosize(revlog, chunk, targetsize):
>> yield subchunk
>>
>> -def _slicechunktosize(revlog, revs, targetsize):
>> +def _slicechunktosize(revlog, revs, targetsize=None):
>> """slice revs to match the target size
>>
>> This is intended to be used on chunk that density slicing selected
>> by that
>> @@ -431,12 +441,16 @@ def _slicechunktosize(revlog, revs, targ
>> endrevidx = idx
>> yield _trimchunk(revlog, revs, startrevidx)
>>
>> -def _slicechunktodensity(revlog, revs, targetdensity=0.5, mingapsize=0):
>> +def _slicechunktodensity(revlog, revs, deltainfo=None, targetdensity=0.5,
>> + mingapsize=0):
>> """slice revs to reduce the amount of unrelated data to be read from
>> disk.
>>
>> ``revs`` is sliced into groups that should be read in one time.
>> Assume that revs are sorted.
>>
>> + ``deltainfo`` is a _deltainfo instance of a revision that we would
>> append
>> + to the top of the revlog.
>> +
>> The initial chunk is sliced until the overall density
>> (payload/chunks-span
>> ratio) is above `targetdensity`. No gap smaller than `mingapsize` is
>> skipped.
>> @@ -487,13 +501,21 @@ def _slicechunktodensity(revlog, revs, t
>> yield revs
>> return
>>
>> - readdata = deltachainspan = _segmentspan(revlog, revs)
>> + nextrev = len(revlog)
>> + nextoffset = revlog.end(nextrev - 1)
>> +
>> + if deltainfo is None:
>> + deltachainspan = _segmentspan(revlog, revs)
>> + chainpayload = sum(length(r) for r in revs)
>> + else:
>> + deltachainspan = deltainfo.distance
>> + chainpayload = deltainfo.compresseddeltalen
>>
>> if deltachainspan < mingapsize:
>> yield revs
>> return
>>
>> - chainpayload = sum(length(r) for r in revs)
>> + readdata = deltachainspan
>>
>> if deltachainspan:
>> density = chainpayload / float(deltachainspan)
>> @@ -504,13 +526,21 @@ def _slicechunktodensity(revlog, revs, t
>> yield revs
>> return
>>
>> + if deltainfo is not None:
>> + revs = list(revs)
>> + revs.append(nextrev)
>> +
>> # Store the gaps in a heap to have them sorted by decreasing size
>> gapsheap = []
>> heapq.heapify(gapsheap)
>> prevend = None
>> for i, rev in enumerate(revs):
>> - revstart = start(rev)
>> - revlen = length(rev)
>> + if rev < nextrev:
>> + revstart = start(rev)
>> + revlen = length(rev)
>> + else:
>> + revstart = nextoffset
>> + revlen = deltainfo.deltalen
>>
>> # Skip empty revisions to form larger holes
>> if revlen == 0:
>> @@ -1980,7 +2010,7 @@ class revlog(object):
>> if not self._withsparseread:
>> slicedchunks = (revs,)
>> else:
>> - slicedchunks = _slicechunk(self, revs, targetsize)
>> + slicedchunks = _slicechunk(self, revs, targetsize=targetsize)
>>
>> for revschunk in slicedchunks:
>> firstrev = revschunk[0]
>> @@ -2393,13 +2423,33 @@ class revlog(object):
>> # deltas we need to apply -- bounding it limits the amount of
>> CPU
>> # we consume.
>>
>> - distance = deltainfo.distance
>> + if self._sparserevlog:
>> + # As sparse-read will be used, we can consider that the
>> distance,
>> + # instead of being the span of the whole chunk,
>> + # is the span of the largest read chunk
>> + base = deltainfo.base
>> +
>> + if base != nullrev:
>> + deltachain = self._deltachain(base)[0]
>> + else:
>> + deltachain = []
>> +
>> + chunks = _slicechunk(self, deltachain, deltainfo)
>> + distance = max(map(lambda revs:_segmentspan(self, revs),
>> chunks))
>> + else:
>> + distance = deltainfo.distance
>> +
>> textlen = revinfo.textlen
>> defaultmax = textlen * 4
>> maxdist = self._maxdeltachainspan
>> if not maxdist:
>> maxdist = distance # ensure the conditional pass
>> maxdist = max(maxdist, defaultmax)
>> + if self._sparserevlog and maxdist < self._srmingapsize:
>> + # In multiple place, we are ignoring irrelevant data range
>> below a
>> + # certain size. Be also apply this tradeoff here and relax
>> span
>> + # constraint for small enought content.
>> + maxdist = self._srmingapsize
>> if (distance > maxdist or deltainfo.deltalen > textlen or
>> deltainfo.compresseddeltalen > textlen * 2 or
>> (self._maxchainlen and deltainfo.chainlen >
>> self._maxchainlen)):
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel at mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
>
> _______________________________________________
> Mercurial-devel mailing listMercurial-devel at mercurial-scm.orghttps://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20180727/6face8bb/attachment.html>
More information about the Mercurial-devel
mailing list