Testing very long delta chains

Wed Dec 23 01:30:20 CST 2015

On Tue, Dec 22, 2015 at 9:41 PM, Matt Mackall <mpm at selenic.com> wrote:

> On Tue, 2015-12-22 at 17:27 -0800, Gregory Szorc wrote:
> > https://www.mercurial-scm.org/wiki/BigRepositories has been updated
> with a
> > link to
> >
> https://hg.mozilla.org/users/gszorc_mozilla.com/mozilla-central-aggressivemerg
> > edeltas,
> > which is a generaldelta clone of mozilla-central with
> > format.aggressivemergedeltas enabled.
> >
> > The last manifest delta chain in this repo is over 45,000 entries deep
> and
> > it makes for a good benchmark for testing revlog reading performance.
> >
> > Remember: `hg clone --uncompressed` to preserve the delta chains from the
> > server or your client will recompute them as part of applying the
> > changegroup.
>
> Without my threaded zlib hack:
>
> $ hg perfmanifest 277045
> ! wall 0.749929 comb 0.740000 user 0.730000 sys 0.010000 (best of 13)
>
> (25% CPU usage on a CPU with 4 threads)
>
> With my threaded zlib hack (threads = 4):
>
> $ hg perfmanifest 277045
> ! wall 0.480251 comb 1.090000 user 0.990000 sys 0.100000
> (best of 20)
>
> (50% CPU usage on a CPU with 4 threads)
>

Assuming 100% CPU usage, that's still ~240ms, which feels a bit steep. I
think 100ms should be the upper limit. Preferably 50ms if we want people to
barely notice the pause. We /might/ get another 2x speedup (~120ms) with
all the changes below. I'm skeptical of getting 4x without limiting delta
chain length or without possibly more things (like I/O and bdiff patching)
moved out of Python (even if Python is just proxying data between C
function calls). (For the record, I share your hesitation to not establish
a delta chain length cap.)

>
> Things we can do better:
>
> - add a C decompress helper
> - that works on lists of buffers
> - that calls zlib directly
> - that uses threads
> - that uses larger buffers
> - that uses a faster zlib
>
> (For this last, the cloudflare fork of zlib has a faster CRC function that
> seems to be worth about 20%)
>

>From C, this will not be fun because Windows.

Half serious question: what are your thoughts on writing this in Rust? We
can write a Rust library that provides a C ABI which Python can interface
with (either via a Python C extension or via ctypes/cffi). Rust should be
supported on all the platforms we care about. For building/distribution, we
can provide wheels for Windows and OS X so `pip install` "just works" [and
doesn't require a working compiler]. That leaves Linux and other Unixen.
Most distributions these days offer a Rust package or are in the process of
offering one. For the laggards, I assume we'll still have the pure Python
fallback.

>
>
> # HG changeset patch
> # User Matt Mackall <mpm at selenic.com>
> # Date 1450727921 21600
> #      Mon Dec 21 13:58:41 2015 -0600
> # Node ID b56bc1676b5d4a14167be2498921b57f06ddcd69
> # Parent  3dea4eae4eebac11741f0c1dc5dcd9c88d8f4554
> revlog: thread decompress
>
> diff -r 3dea4eae4eeb -r b56bc1676b5d mercurial/revlog.py
> --- a/mercurial/revlog.py       Mon Dec 21 14:52:18 2015 -0600
> +++ b/mercurial/revlog.py       Mon Dec 21 13:58:41 2015 -0600
> @@ -17,6 +17,8 @@
>  import errno
>  import os
>  import struct
> +import threading
> +import Queue
>  import zlib
>
>  # import stuff from node for others to import from revlog
> @@ -1132,14 +1134,38 @@
>              # 2G on Windows
>              return [self._chunk(rev, df=df) for rev in revs]
>
> -        for rev in revs:
> +        slots = [None] * len(revs)
> +
> +        work = []
> +        done = Queue.Queue()
> +
> +        for slot, rev in enumerate(revs):
>              chunkstart = start(rev)
>              if inline:
>                  chunkstart += (rev + 1) * iosize
>              chunklength = length(rev)
> -            ladd(decompress(buffer(data, chunkstart - offset,
> chunklength)))
> +            buf = buffer(data, chunkstart - offset, chunklength)
> +            if buf and buf[0] == 'x':
> +                work.append((slot, buf))
> +            else:
> +                slots[slot] = decompress(buf)
>
> -        return l
> +        def worker():
> +            try:
> +                while True:
> +                    slot, buf = work.pop()
> +                    slots[slot] = _decompress(buf)
> +            except:
> +                done.put(1)
> +
> +        tcount = 4
> +        for w in xrange(tcount - 1):
> +            threading.Thread(target=worker).start()
> +        worker()
> +        for w in xrange(tcount):
> +            done.get()
> +
> +        return slots
>
>      def _chunkclear(self):
>          """Clear the raw chunk cache."""
>
> --
> Mathematics is the supreme nostalgia of our time.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20151222/125936d9/attachment-0001.html>