Speeding up Mercurial on NFS

Tue Jan 11 09:23:30 CST 2011

On Tue, 2011-01-11 at 12:11 +0100, Martin Geisler wrote:
> > Also, it's just not very interesting. On spinning media, we know that
> > we're going to have a seek bottleneck that multithreading can only
> > exacerbate. On SSD, we're either going to hit a req/s barrier that's
> > lower than our syscall throughput (no scaling) or we're going to going
> > to saturate the syscall interface (scaling up to number of cores), or
> > we're going to saturate the filesystem lookup locks (but not on modern
> > Linux with a desktop machine!). What you have above appears to be an
> > SSD and two cores, yes?
> 
> I agree with you that a single disk should max out like you describe...
> but the above numbers are for a normal 7200 RPM 1 TB SATA disk and a
> quad core i7 930.

Then you're getting lucky with disk layout and read-ahead. It's very
easy for an 'aged' directory tree to take much longer than 5 seconds to
walk on spinning media.

> > By comparison, threaded NFS lookups is all about saturating the pipe
> > because the (cached on server) lookup is much faster than the request
> > round trip.
> >
> > How many files are you walking here?
> 
> There are 70k files -- it is the working copy of OpenOffice, changeset
> 67e476e04669. You sometimes talk about walking a repo with 207k files,
> is that a public repo?

Might have been Netbeans?

> >> Running over an artificially slow (0.2 ms delay) NFS link back to
> >> localhost gives:
> >> 
> >>   threads  pywalker  walker
> >>     1       9.0 s     8.2 s
> >>     2       6.3 s     4.5 s
> >>     4       6.1 s     2.7 s
> >>     8       5.9 s     1.5 s
> >>    16       6.0 s     1.7 s
> >>    32       6.0 s     1.9 s
> >
> > Interesting. Looks like you're getting about 6-8 parallel requests per
> > round trip time. But that seems way too slow, that'd only be ~ 15k
> > requests per second or 22500 - 30k files total. Or, if that .2ms is
> > round-trip delay, 45k - 60k files total.
> 
> Yes, it's round-trip delay -- I add 0.1 ms delay to the link and the
> ping time goes to 0.2 ms. I use
> 
>   sudo tc qdisc change dev lo root netem delay 0.1ms
> 
> to add a simple constant delay to all packets.
> 
> > You should also run this test without the delay. Again, this will give
> > you a target baseline for what you can hope to get out of NFS. It
> > should saturate around threads = cores, but should probably be
> > marginally faster than that 1.5s number there.
> 
> Okay, here are tests without the delay -- raw speed on the local
> loopback. I unmount the NFS filesystem after each test but do not clear
> any other caches:
> 
>   threads  pywalker  walker
>     1      2230 ms   1931 ms
>     2      1857 ms   1164 ms
>     4      2594 ms    818 ms
>     8      2757 ms    833 ms
>    16      2796 ms    991 ms
>    32      2776 ms    987 ms
> 
> The eight (hyper-threading) cores were never maxed out while I ran the
> tests, they only peaked up to about 50% utilization.

Hmmm, how are you measuring that utilization? HT can make such numbers
muddled. You might try comparing to a raw CPU eater (eg python -c 'while
1: pass') to make sure you're measuring what you think you're measuring.

In principle, you should be able to match those numbers even with
the .2ms delay. In practice, details like lock contention[1] and CPU
cache footprint will start to matter. But not until you're at 100%
utilization.

This is where you compare to a local cache-hot walk to see how much more
it's possible to squeeze out of this. If the local walk is 10x faster,
there's probably a lot of room to squeeze. If it's only 2x faster, it
may be all 'network overhead'. And the resulting CPU load may give you
some indication whether the slack utilization with NFS is your fault or
not. You should also consider measuring user vs sys times too.

> At first I thought this was because of how I walk the tree: each worker
> thread scans a directory and inserts each subdirectory into a queue. It
> then returns and grabs the next directory from the queue. This gives a
> breadth-first traversal of the directory tree.

...and if your queue is ever empty, you're losing. Should be simple to
instrument the queue length.

-- 
Mathematics is the supreme nostalgia of our time.