[PATCH 3 of 5] sparse-revlog: implement algorithm to write sparse delta chains (issue5480)

Martin von Zweigbergk martinvonz at google.com
Thu Jul 26 19:57:36 EDT 2018


On Mon, Jul 16, 2018 at 11:50 AM Boris Feld <boris.feld at octobus.net> wrote:

> # HG changeset patch
> # User Paul Morelle <paul.morelle at octobus.net>
> # Date 1528179575 -7200
> #      Tue Jun 05 08:19:35 2018 +0200
> # Node ID c34ef3def14e04dca76c43667766496a99636b44
> # Parent  5ae60e5a705ef273d316e33df401e4c44a4c482a
> # EXP-Topic write-for-sparse-read
> # Available At https://bitbucket.org/octobus/mercurial-devel/
> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> c34ef3def14e
> sparse-revlog: implement algorithm to write sparse delta chains (issue5480)
>
> The classic behavior of revlog._isgooddeltainfo is to consider the span
> size
> of the whole delta chain, and limit it to 4 * textlen.
> Once sparse-revlog writing is allowed (and enforced with a requirement),
> revlog._isgooddeltainfo considers the span of the largest chunk as the
> distance used in the verification, instead of using the span of the whole
> delta chain.
>
> In order to compute the span of the largest chunk, we need to slice into
> chunks a chain with the new revision at the top of the revlog, and take the
> maximal span of these chunks. The sparse read density is a parameter to the
> slicing, as it will stop when the global read density reaches this
> threshold.
> For instance, a density of 50% means that 2 of 4 read bytes are actually
> used
> for the reconstruction of the revision (the others are part of other
> chains).
>
> This allows a new revision to be potentially stored with a diff against
> another revision anywhere in the history, instead of forcing it in the
> last 4
> * textlen. The result is a much better compression on repositories that
> have
> many concurrent branches. Here are a comparison between using deltas from
> current upstream (aggressive-merge-deltas on by default) and deltas from a
> sparse-revlog
>

Did you consider simply removing the restriction on span distance? See how
FB did it in
https://bitbucket.org/facebook/hg-experimental/commits/f60c6d7b9976ce1be85f51883d49b12b13302b27.
I tried it on the pypy repo and it shrunk the 00manifest.d a little more
(from 329MB with this patch to 282MB with Durham's). I couldn't measure any
operations getting slower. Do you have any operations that you would expect
would get slower with that patch? I tried setting maxchainlen=1000 and it
produced an insignificantly larger 00manifest.d (~1MB), in case you're
thinking that the smaller manifest is at the expense of very long chains.
Or maybe the difference would mostly be noticed on cold disk? I didn't test
that at all.


> Comparison of `.hg/store/` size:
>
>     mercurial (6.74% merges):
>         before:     46,831,873 bytes
>         after:      46,795,992 bytes (no relevant change)
>     pypy (8.30% merges):
>         before:    333,524,651 bytes
>         after:     308,417,511 bytes -8%
>     netbeans (34.21% merges):
>         before:  1,141,847,554 bytes
>         after:   1,131,093,161 bytes -1%
>     mozilla-central (4.84% merges):
>         before:  2,344,248,850 bytes
>         after:   2,328,459,258 bytes -1%
>     large-private-repo-A (merge 19.73%)
>         before: 41,510,550,163 bytes
>         after:   8,121,763,428 bytes -80%
>     large-private-repo-B (23.77%)
>         before: 58,702,221,709 bytes
>         after:   8,351,588,828 bytes -76%
>
> Comparison of `00manifest.d` size:
>
>     mercurial (6.74% merges):
>         before:      6,143,044 bytes
>         after:       6,107,163 bytes
>     pypy (8.30% merges):
>         before:     52,941,780 bytes
>         after:      27,834,082 bytes -48%
>     netbeans (34.21% merges):
>         before:    130,088,982 bytes
>         after:     119,337,636 bytes -10%
>     mozilla-central (4.84% merges):
>         before:    215,096,339 bytes
>         after:     199,496,863 bytes -8%
>     large-private-repo-A (merge 19.73%)
>         before: 33,725,285,081 bytes
>         after:     390,302,545 bytes -99%
>     large-private-repo-B (23.77%)
>         before: 49,457,701,645 bytes
>         after:   1,366,752,187 bytes -97%
>
>
> The better delta chains provide a performance boost in relevant
> repositories:
>
>     pypy, bundling 1000 revisions:
>         before: 1.670s
>         after:  1.149s -31%
>
> Unbundling got a bit slower. probably because the sparse algorithm is still
> pure
> python.
>
>     pypy, unbundling 1000 revisions:
>         before: 4.062s
>         after:  4.507s +10%
>
> Performance of bundle/unbundle in repository with few concurrent branches
> (eg:
> mercurial) are unaffected.
>
> No significant differences have been noticed then timing `hg push` and `hg
> pull` locally. More state timings are being gathered.
>
> Same as for aggressive-merge-delta, better delta comes with longer delta
> chains. Longer chains have a performance impact. For example. The length of
> the chain needed to get the manifest of pypy's tip moves from 82 item to
> 1929
> items. This moves the restore time from 3.88ms to 11.3ms.
>
> Delta chain length is an independent issue that affects repository without
> this changes. It will be dealt with independently.
>
> No significant differences have been observed on repositories where
> `sparse-revlog` have not much effect (mercurial, unity, netbeans). On pypy,
> small differences have been observed on some operation affected by delta
> chain
> building and retrieval.
>
>
>     pypy, perfmanifest
>         before: 0.006162s
>         after:  0.017899s +190%
>
>     pypy, commit:
>         before: 0.382
>         after:  0.376 -1%
>
>     pypy, status:
>         before: 0.157
>         after:  0.168 +7%
>
> More comprehensive and stable timing comparisons are in progress.
>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -216,6 +216,9 @@ class _testrevlog(object):
>      def length(self, rev):
>          return self.end(rev) - self.start(rev)
>
> +    def __len__(self):
> +        return len(self._data)
> +
>  def _trimchunk(revlog, revs, startidx, endidx=None):
>      """returns revs[startidx:endidx] without empty trailing revs
>
> @@ -293,7 +296,7 @@ def _segmentspan(revlog, revs):
>          return 0
>      return revlog.end(revs[-1]) - revlog.start(revs[0])
>
> -def _slicechunk(revlog, revs, targetsize=None):
> +def _slicechunk(revlog, revs, deltainfo=None, targetsize=None):
>      """slice revs to reduce the amount of unrelated data to be read from
> disk.
>
>      ``revs`` is sliced into groups that should be read in one time.
> @@ -341,20 +344,27 @@ def _slicechunk(revlog, revs, targetsize
>      [[1, 2], [5, 8, 10, 11], [14]]
>
>      Slicing with a maximum chunk size
> -    >>> list(_slicechunk(revlog, [0, 11, 13, 15], 15))
> +    >>> list(_slicechunk(revlog, [0, 11, 13, 15], targetsize=15))
>      [[0], [11], [13], [15]]
> -    >>> list(_slicechunk(revlog, [0, 11, 13, 15], 20))
> +    >>> list(_slicechunk(revlog, [0, 11, 13, 15], targetsize=20))
>      [[0], [11], [13, 15]]
>      """
>      if targetsize is not None:
>          targetsize = max(targetsize, revlog._srmingapsize)
> +    # targetsize should not be specified when evaluating delta candidates:
> +    # * targetsize is used to ensure we stay within specification when
> reading,
> +    # * deltainfo is used to pick are good delta chain when writing.
> +    if not (deltainfo is None or targetsize is None):
> +        msg = 'cannot use `targetsize` with a `deltainfo`'
> +        raise error.ProgrammingError(msg)
>      for chunk in _slicechunktodensity(revlog, revs,
> +                                      deltainfo,
>                                        revlog._srdensitythreshold,
>                                        revlog._srmingapsize):
>          for subchunk in _slicechunktosize(revlog, chunk, targetsize):
>              yield subchunk
>
> -def _slicechunktosize(revlog, revs, targetsize):
> +def _slicechunktosize(revlog, revs, targetsize=None):
>      """slice revs to match the target size
>
>      This is intended to be used on chunk that density slicing selected by
> that
> @@ -431,12 +441,16 @@ def _slicechunktosize(revlog, revs, targ
>              endrevidx = idx
>      yield _trimchunk(revlog, revs, startrevidx)
>
> -def _slicechunktodensity(revlog, revs, targetdensity=0.5, mingapsize=0):
> +def _slicechunktodensity(revlog, revs, deltainfo=None, targetdensity=0.5,
> +                         mingapsize=0):
>      """slice revs to reduce the amount of unrelated data to be read from
> disk.
>
>      ``revs`` is sliced into groups that should be read in one time.
>      Assume that revs are sorted.
>
> +    ``deltainfo`` is a _deltainfo instance of a revision that we would
> append
> +    to the top of the revlog.
> +
>      The initial chunk is sliced until the overall density
> (payload/chunks-span
>      ratio) is above `targetdensity`. No gap smaller than `mingapsize` is
>      skipped.
> @@ -487,13 +501,21 @@ def _slicechunktodensity(revlog, revs, t
>          yield revs
>          return
>
> -    readdata = deltachainspan = _segmentspan(revlog, revs)
> +    nextrev = len(revlog)
> +    nextoffset = revlog.end(nextrev - 1)
> +
> +    if deltainfo is None:
> +        deltachainspan = _segmentspan(revlog, revs)
> +        chainpayload = sum(length(r) for r in revs)
> +    else:
> +        deltachainspan = deltainfo.distance
> +        chainpayload = deltainfo.compresseddeltalen
>
>      if deltachainspan < mingapsize:
>          yield revs
>          return
>
> -    chainpayload = sum(length(r) for r in revs)
> +    readdata = deltachainspan
>
>      if deltachainspan:
>          density = chainpayload / float(deltachainspan)
> @@ -504,13 +526,21 @@ def _slicechunktodensity(revlog, revs, t
>          yield revs
>          return
>
> +    if deltainfo is not None:
> +        revs = list(revs)
> +        revs.append(nextrev)
> +
>      # Store the gaps in a heap to have them sorted by decreasing size
>      gapsheap = []
>      heapq.heapify(gapsheap)
>      prevend = None
>      for i, rev in enumerate(revs):
> -        revstart = start(rev)
> -        revlen = length(rev)
> +        if rev < nextrev:
> +            revstart = start(rev)
> +            revlen = length(rev)
> +        else:
> +            revstart = nextoffset
> +            revlen = deltainfo.deltalen
>
>          # Skip empty revisions to form larger holes
>          if revlen == 0:
> @@ -1980,7 +2010,7 @@ class revlog(object):
>          if not self._withsparseread:
>              slicedchunks = (revs,)
>          else:
> -            slicedchunks = _slicechunk(self, revs, targetsize)
> +            slicedchunks = _slicechunk(self, revs, targetsize=targetsize)
>
>          for revschunk in slicedchunks:
>              firstrev = revschunk[0]
> @@ -2393,13 +2423,33 @@ class revlog(object):
>          #   deltas we need to apply -- bounding it limits the amount of
> CPU
>          #   we consume.
>
> -        distance = deltainfo.distance
> +        if self._sparserevlog:
> +            # As sparse-read will be used, we can consider that the
> distance,
> +            # instead of being the span of the whole chunk,
> +            # is the span of the largest read chunk
> +            base = deltainfo.base
> +
> +            if base != nullrev:
> +                deltachain = self._deltachain(base)[0]
> +            else:
> +                deltachain = []
> +
> +            chunks = _slicechunk(self, deltachain, deltainfo)
> +            distance = max(map(lambda revs:_segmentspan(self, revs),
> chunks))
> +        else:
> +            distance = deltainfo.distance
> +
>          textlen = revinfo.textlen
>          defaultmax = textlen * 4
>          maxdist = self._maxdeltachainspan
>          if not maxdist:
>              maxdist = distance # ensure the conditional pass
>          maxdist = max(maxdist, defaultmax)
> +        if self._sparserevlog and maxdist < self._srmingapsize:
> +            # In multiple place, we are ignoring irrelevant data range
> below a
> +            # certain size. Be also apply this tradeoff here and relax
> span
> +            # constraint for small enought content.
> +            maxdist = self._srmingapsize
>          if (distance > maxdist or deltainfo.deltalen > textlen or
>              deltainfo.compresseddeltalen > textlen * 2 or
>              (self._maxchainlen and deltainfo.chainlen >
> self._maxchainlen)):
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20180726/458931fc/attachment.html>


More information about the Mercurial-devel mailing list