[PATCH 3 of 5] sparse-revlog: implement algorithm to write sparse delta chains (issue5480)

Fri Jul 27 16:24:04 EDT 2018

On 27/07/2018 01:57, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
> On Mon, Jul 16, 2018 at 11:50 AM Boris Feld <boris.feld at octobus.net 
> <mailto:boris.feld at octobus.net>> wrote:
>
>     # HG changeset patch
>     # User Paul Morelle <paul.morelle at octobus.net
>     <mailto:paul.morelle at octobus.net>>
>     # Date 1528179575 -7200
>     #      Tue Jun 05 08:19:35 2018 +0200
>     # Node ID c34ef3def14e04dca76c43667766496a99636b44
>     # Parent  5ae60e5a705ef273d316e33df401e4c44a4c482a
>     # EXP-Topic write-for-sparse-read
>     # Available At https://bitbucket.org/octobus/mercurial-devel/
>     #              hg pull
>     https://bitbucket.org/octobus/mercurial-devel/ -r c34ef3def14e
>     sparse-revlog: implement algorithm to write sparse delta chains
>     (issue5480)
>
>     The classic behavior of revlog._isgooddeltainfo is to consider the
>     span size
>     of the whole delta chain, and limit it to 4 * textlen.
>     Once sparse-revlog writing is allowed (and enforced with a
>     requirement),
>     revlog._isgooddeltainfo considers the span of the largest chunk as the
>     distance used in the verification, instead of using the span of
>     the whole
>     delta chain.
>
>     In order to compute the span of the largest chunk, we need to
>     slice into
>     chunks a chain with the new revision at the top of the revlog, and
>     take the
>     maximal span of these chunks. The sparse read density is a
>     parameter to the
>     slicing, as it will stop when the global read density reaches this
>     threshold.
>     For instance, a density of 50% means that 2 of 4 read bytes are
>     actually used
>     for the reconstruction of the revision (the others are part of
>     other chains).
>
>     This allows a new revision to be potentially stored with a diff
>     against
>     another revision anywhere in the history, instead of forcing it in
>     the last 4
>     * textlen. The result is a much better compression on repositories
>     that have
>     many concurrent branches. Here are a comparison between using
>     deltas from
>     current upstream (aggressive-merge-deltas on by default) and
>     deltas from a
>     sparse-revlog
>
>
> Did you consider simply removing the restriction on span distance? See 
> how FB did it in 
> https://bitbucket.org/facebook/hg-experimental/commits/f60c6d7b9976ce1be85f51883d49b12b13302b27. 
> I tried it on the pypy repo and it shrunk the 00manifest.d a little 
> more (from 329MB with this patch to 282MB with Durham's). I couldn't 
> measure any operations getting slower. Do you have any operations that 
> you would expect would get slower with that patch? I tried setting 
> maxchainlen=1000 and it produced an insignificantly larger 
> 00manifest.d (~1MB), in case you're thinking that the smaller manifest 
> is at the expense of very long chains. Or maybe the difference would 
> mostly be noticed on cold disk? I didn't test that at all.

We came to the same conclusion and sent a series last year to allow 
setting it up with `experimental.maxdeltachainspan=0` (see e9d325cfe071 
for details).

Testing it on several repositories in the wild shown that reading the 
full unlimited span introduces memory consumption issues. The 
sparse-read capabilities and delta strategies are the solutions to these 
issues.

>
>
>     Comparison of `.hg/store/` size:
>
>         mercurial (6.74% merges):
>             before:     46,831,873 bytes
>             after:      46,795,992 bytes (no relevant change)
>         pypy (8.30% merges):
>             before:    333,524,651 bytes
>             after:     308,417,511 bytes -8%
>         netbeans (34.21% merges):
>             before:  1,141,847,554 bytes
>             after:   1,131,093,161 bytes -1%
>         mozilla-central (4.84% merges):
>             before:  2,344,248,850 bytes
>             after:   2,328,459,258 bytes -1%
>         large-private-repo-A (merge 19.73%)
>             before: 41,510,550,163 bytes
>             after:   8,121,763,428 bytes -80%
>         large-private-repo-B (23.77%)
>             before: 58,702,221,709 bytes
>             after:   8,351,588,828 bytes -76%
>
>     Comparison of `00manifest.d` size:
>
>         mercurial (6.74% merges):
>             before:      6,143,044 bytes
>             after:       6,107,163 bytes
>         pypy (8.30% merges):
>             before:     52,941,780 bytes
>             after:      27,834,082 bytes -48%
>         netbeans (34.21% merges):
>             before:    130,088,982 bytes
>             after:     119,337,636 bytes -10%
>         mozilla-central (4.84% merges):
>             before:    215,096,339 bytes
>             after:     199,496,863 bytes -8%
>         large-private-repo-A (merge 19.73%)
>             before: 33,725,285,081 bytes
>             after:     390,302,545 bytes -99%
>         large-private-repo-B (23.77%)
>             before: 49,457,701,645 bytes
>             after:   1,366,752,187 bytes -97%
>
>
>     The better delta chains provide a performance boost in relevant
>     repositories:
>
>         pypy, bundling 1000 revisions:
>             before: 1.670s
>             after:  1.149s -31%
>
>     Unbundling got a bit slower. probably because the sparse algorithm
>     is still
>     pure
>     python.
>
>         pypy, unbundling 1000 revisions:
>             before: 4.062s
>             after:  4.507s +10%
>
>     Performance of bundle/unbundle in repository with few concurrent
>     branches (eg:
>     mercurial) are unaffected.
>
>     No significant differences have been noticed then timing `hg push`
>     and `hg
>     pull` locally. More state timings are being gathered.
>
>     Same as for aggressive-merge-delta, better delta comes with longer
>     delta
>     chains. Longer chains have a performance impact. For example. The
>     length of
>     the chain needed to get the manifest of pypy's tip moves from 82
>     item to 1929
>     items. This moves the restore time from 3.88ms to 11.3ms.
>
>     Delta chain length is an independent issue that affects repository
>     without
>     this changes. It will be dealt with independently.
>
>     No significant differences have been observed on repositories where
>     `sparse-revlog` have not much effect (mercurial, unity, netbeans).
>     On pypy,
>     small differences have been observed on some operation affected by
>     delta chain
>     building and retrieval.
>
>
>         pypy, perfmanifest
>             before: 0.006162s
>             after:  0.017899s +190%
>
>         pypy, commit:
>             before: 0.382
>             after:  0.376 -1%
>
>         pypy, status:
>             before: 0.157
>             after:  0.168 +7%
>
>     More comprehensive and stable timing comparisons are in progress.
>
>     diff --git a/mercurial/revlog.py b/mercurial/revlog.py
>     --- a/mercurial/revlog.py
>     +++ b/mercurial/revlog.py
>     @@ -216,6 +216,9 @@ class _testrevlog(object):
>          def length(self, rev):
>              return self.end(rev) - self.start(rev)
>
>     +    def __len__(self):
>     +        return len(self._data)
>     +
>      def _trimchunk(revlog, revs, startidx, endidx=None):
>          """returns revs[startidx:endidx] without empty trailing revs
>
>     @@ -293,7 +296,7 @@ def _segmentspan(revlog, revs):
>              return 0
>          return revlog.end(revs[-1]) - revlog.start(revs[0])
>
>     -def _slicechunk(revlog, revs, targetsize=None):
>     +def _slicechunk(revlog, revs, deltainfo=None, targetsize=None):
>          """slice revs to reduce the amount of unrelated data to be
>     read from disk.
>
>          ``revs`` is sliced into groups that should be read in one time.
>     @@ -341,20 +344,27 @@ def _slicechunk(revlog, revs, targetsize
>          [[1, 2], [5, 8, 10, 11], [14]]
>
>          Slicing with a maximum chunk size
>     -    >>> list(_slicechunk(revlog, [0, 11, 13, 15], 15))
>     +    >>> list(_slicechunk(revlog, [0, 11, 13, 15], targetsize=15))
>          [[0], [11], [13], [15]]
>     -    >>> list(_slicechunk(revlog, [0, 11, 13, 15], 20))
>     +    >>> list(_slicechunk(revlog, [0, 11, 13, 15], targetsize=20))
>          [[0], [11], [13, 15]]
>          """
>          if targetsize is not None:
>              targetsize = max(targetsize, revlog._srmingapsize)
>     +    # targetsize should not be specified when evaluating delta
>     candidates:
>     +    # * targetsize is used to ensure we stay within specification
>     when reading,
>     +    # * deltainfo is used to pick are good delta chain when writing.
>     +    if not (deltainfo is None or targetsize is None):
>     +        msg = 'cannot use `targetsize` with a `deltainfo`'
>     +        raise error.ProgrammingError(msg)
>          for chunk in _slicechunktodensity(revlog, revs,
>     +                                      deltainfo,
>      revlog._srdensitythreshold,
>      revlog._srmingapsize):
>              for subchunk in _slicechunktosize(revlog, chunk, targetsize):
>                  yield subchunk
>
>     -def _slicechunktosize(revlog, revs, targetsize):
>     +def _slicechunktosize(revlog, revs, targetsize=None):
>          """slice revs to match the target size
>
>          This is intended to be used on chunk that density slicing
>     selected by that
>     @@ -431,12 +441,16 @@ def _slicechunktosize(revlog, revs, targ
>                  endrevidx = idx
>          yield _trimchunk(revlog, revs, startrevidx)
>
>     -def _slicechunktodensity(revlog, revs, targetdensity=0.5,
>     mingapsize=0):
>     +def _slicechunktodensity(revlog, revs, deltainfo=None,
>     targetdensity=0.5,
>     +                         mingapsize=0):
>          """slice revs to reduce the amount of unrelated data to be
>     read from disk.
>
>          ``revs`` is sliced into groups that should be read in one time.
>          Assume that revs are sorted.
>
>     +    ``deltainfo`` is a _deltainfo instance of a revision that we
>     would append
>     +    to the top of the revlog.
>     +
>          The initial chunk is sliced until the overall density
>     (payload/chunks-span
>          ratio) is above `targetdensity`. No gap smaller than
>     `mingapsize` is
>          skipped.
>     @@ -487,13 +501,21 @@ def _slicechunktodensity(revlog, revs, t
>              yield revs
>              return
>
>     -    readdata = deltachainspan = _segmentspan(revlog, revs)
>     +    nextrev = len(revlog)
>     +    nextoffset = revlog.end(nextrev - 1)
>     +
>     +    if deltainfo is None:
>     +        deltachainspan = _segmentspan(revlog, revs)
>     +        chainpayload = sum(length(r) for r in revs)
>     +    else:
>     +        deltachainspan = deltainfo.distance
>     +        chainpayload = deltainfo.compresseddeltalen
>
>          if deltachainspan < mingapsize:
>              yield revs
>              return
>
>     -    chainpayload = sum(length(r) for r in revs)
>     +    readdata = deltachainspan
>
>          if deltachainspan:
>              density = chainpayload / float(deltachainspan)
>     @@ -504,13 +526,21 @@ def _slicechunktodensity(revlog, revs, t
>              yield revs
>              return
>
>     +    if deltainfo is not None:
>     +        revs = list(revs)
>     +        revs.append(nextrev)
>     +
>          # Store the gaps in a heap to have them sorted by decreasing size
>          gapsheap = []
>          heapq.heapify(gapsheap)
>          prevend = None
>          for i, rev in enumerate(revs):
>     -        revstart = start(rev)
>     -        revlen = length(rev)
>     +        if rev < nextrev:
>     +            revstart = start(rev)
>     +            revlen = length(rev)
>     +        else:
>     +            revstart = nextoffset
>     +            revlen = deltainfo.deltalen
>
>              # Skip empty revisions to form larger holes
>              if revlen == 0:
>     @@ -1980,7 +2010,7 @@ class revlog(object):
>              if not self._withsparseread:
>                  slicedchunks = (revs,)
>              else:
>     -            slicedchunks = _slicechunk(self, revs, targetsize)
>     +            slicedchunks = _slicechunk(self, revs,
>     targetsize=targetsize)
>
>              for revschunk in slicedchunks:
>                  firstrev = revschunk[0]
>     @@ -2393,13 +2423,33 @@ class revlog(object):
>              #   deltas we need to apply -- bounding it limits the
>     amount of CPU
>              #   we consume.
>
>     -        distance = deltainfo.distance
>     +        if self._sparserevlog:
>     +            # As sparse-read will be used, we can consider that
>     the distance,
>     +            # instead of being the span of the whole chunk,
>     +            # is the span of the largest read chunk
>     +            base = deltainfo.base
>     +
>     +            if base != nullrev:
>     +                deltachain = self._deltachain(base)[0]
>     +            else:
>     +                deltachain = []
>     +
>     +            chunks = _slicechunk(self, deltachain, deltainfo)
>     +            distance = max(map(lambda revs:_segmentspan(self,
>     revs), chunks))
>     +        else:
>     +            distance = deltainfo.distance
>     +
>              textlen = revinfo.textlen
>              defaultmax = textlen * 4
>              maxdist = self._maxdeltachainspan
>              if not maxdist:
>                  maxdist = distance # ensure the conditional pass
>              maxdist = max(maxdist, defaultmax)
>     +        if self._sparserevlog and maxdist < self._srmingapsize:
>     +            # In multiple place, we are ignoring irrelevant data
>     range below a
>     +            # certain size. Be also apply this tradeoff here and
>     relax span
>     +            # constraint for small enought content.
>     +            maxdist = self._srmingapsize
>              if (distance > maxdist or deltainfo.deltalen > textlen or
>                  deltainfo.compresseddeltalen > textlen * 2 or
>                  (self._maxchainlen and deltainfo.chainlen >
>     self._maxchainlen)):
>     _______________________________________________
>     Mercurial-devel mailing list
>     Mercurial-devel at mercurial-scm.org
>     <mailto:Mercurial-devel at mercurial-scm.org>
>     https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20180727/6c3cdc77/attachment.html>