Speeding up Mercurial on NFS

Thu Jan 13 12:35:08 CST 2011

On Wed, 2011-01-12 at 13:40 +0100, Martin Geisler wrote:
> Matt Mackall <mpm at selenic.com> writes:
> 
> > On Tue, 2011-01-11 at 12:11 +0100, Martin Geisler wrote:
> >
> >> I agree with you that a single disk should max out like you
> >> describe... but the above numbers are for a normal 7200 RPM 1 TB SATA
> >> disk and a quad core i7 930.
> >
> > Then you're getting lucky with disk layout and read-ahead. It's very
> > easy for an 'aged' directory tree to take much longer than 5 seconds
> > to walk on spinning media.
> 
> Yeah, the directory is not aged in any way, it's just a clone of
> OpenOffice that I made at some point.
> 
> >> > By comparison, threaded NFS lookups is all about saturating the
> >> > pipe because the (cached on server) lookup is much faster than the
> >> > request round trip.
> >> >
> >> > How many files are you walking here?
> >> 
> >> There are 70k files -- it is the working copy of OpenOffice,
> >> changeset 67e476e04669. You sometimes talk about walking a repo with
> >> 207k files, is that a public repo?
> >
> > Might have been Netbeans?
> 
> I just cloned it and they "only" have 98k files and 186k changesets.

Hmm, no idea then.

> Oh, I just looked at the graph in the Gnome System Monitor and saw that
> the spikes went no further than ~50% or so. It shows 8 curves, one for
> each "virtual" core.

Yeah, I don't think that's actually meaningful. You can never get both
threads on a core to 100%, so there's no way to tell how close to
saturation you are. It might be when all threads are at 50%, 40%, or
60%, depending on workload.

> Okay, as a start I have timings for a cache-hot local walk here:
> 
>    threads  pywalker  walker
>      1       565 ms   259 ms
>      2      1330 ms   204 ms
>      4      1707 ms   440 ms
>      8      1834 ms   636 ms
>     16      1947 ms   739 ms
>     32      1969 ms   765 ms

Huh. This hits a performance wall before you reach the number of cores
and the scaling even at 2 cores is bad. That wall is probably due to
cacheline bouncing as various lookups touch things in the dcache. In the
not-yet released kernel, dcache lookup is now 'store-free', so this
should go away and yield vastly better numbers for high N.

But it means you're probably at the limit of what you can do with your
testing on a single system: if you've got 8 threads trying to fill the
loopback 'pipe', the kernel NFS server is probably going to process them
on the same core as the submitting thread (or equally, a random core)
with the same cache ping-pong effects.

If we subtract the local walk speed from the NFS walk speed, we end up
with something like:

   local  nfs  diff
1    259  1931 1672
2    204  1164  960
4    440   818  378
8    636   833  197
16   739   991  252

Here 'diff' is effectively the overhead of going over NFS. If we could
combine the best-case NFS communication overhead of 197ms with the best
case walk result of 204ms, we'd be down at 401ms.

That'd be more like the result you could achieve having a well-tuned
multithreaded client saturating a dedicated NFS server.

-- 
Mathematics is the supreme nostalgia of our time.