[issue1300] hg client hangs when ssh server reports error

Mon Sep 15 20:04:34 CDT 2008

New submission from Tom Karzes <Tom.Karzes at magnumsemi.com>:

Using version 1.0.1 of Mercurial, with OpenSSH:

On certain platforms (e.g. Redhat Linux), when accessing a repository via an 
ssh server, the local hg client often hangs if the server reports an error 
(e.g. a bad repository path, or a hook failure, etc.)  Even worse, it is 
usually not at all obvious that there's an error until you finally get tired of 
waiting and interrupt it, and even then the error messages are often truncated 
or even entirely absent.

When we added some server-side hooks, it became clear that we needed a solution 
to this problem, so I spent quite a bit of time tracking it down, and I finally 
know exactly what's causing it.  I believe the hg client is ultimately to 
blame, and that a proper fix needs to be made there to truly solve the 
problem.  But having said that, there are some contributing factors that make 
this problem vastly more likely on some systems (e.g. Redhat) than on others 
(e.g. Ubuntu).

Here's what the hg client is doing, which I believe should be changed:  In 
sshrepo.c, it communicates with the ssh client via pipes to the client's stdin, 
stdout, and stderr.  When reading the ssh client's stdout, it uses blocking 
readline() calls.  When reading the ssh client's stderr, it uses fstat to see 
if there's any data available, then readline() to read it.  It effectively 
alternates between reading stdout and stderr this way, but this is clearly very 
vulnerable to deadlock situations.  In particular, if the hg client is blocked 
trying to read stdout, while the ssh client is blocked trying to write to 
stderr (i.e., waiting for some data to be consumed), there is a deadlock 
condition.  This could happen under any of a number of conditions.  The most 
obvious is when one of the pipe buffers fills up, but on some systems (such as 
Redhat) this can happen even if the buffer is not full, making this problem 
vastly more likely.

The correct way for the hg client to handle all of this on a Linux platform is 
to use the select() system call to choose among the ssh client's stdin, stdout, 
and stderr, servicing them as they become ready (in the case of stdin, only 
when it has data to write).  This is exactly what the ssh client does, and it 
eliminates deadlock problems:  As long as there's *something* it can do, it 
does it, and if the thing it's talking to does the same thing, there will be no 
deadlock.  That is really the only correct way to truly fix this.

So the question is this:  Why is this causing deadlock when the only thing the 
hg server is doing is printing a one-line error message to stderr and then 
exiting?  Here is the answer:  First, even when doing a single "print" from 
Python, the resulting string will often be broken into two parts (Python often 
sends the terminating newline in a second string).  So it is not uncommon to 
get multiple back-to-back writes to stderr.

Back on the client side, the client ssh is doing selects to see when it can 
write to stderr.  When it reads the first stderr string from the server, it 
does the select, Linux tells it that stderr is ready to be written, and it 
writes it.  Then the second string arrives, and it again does a select to see 
when it can write more to stderr.  Here's where the os-specific sensitivity 
arises:  On Ubuntu, select() on a pipe will return ready status even if the 
pipe already contains some data.  So on Ubuntu systems, the second string gets 
written right away, provided the pipe buffer isn't overly full.  Then, if that 
was the last thing the hg server did before exiting, the ssh client will also 
exit, causing any pending readline() on its stdout to immediately exit as well.

On older Linux kernels (such as Redhat), however, select() on a pipe behaves 
quite differently.  If there is any unread data in the pipe, even as little as 
a single byte, select() will not flag the pipe as ready to receive more data, 
even if writing to it would succeed.  And this is what was triggering all of 
the deadlocks we were actually seeing.  Note that, if the server wrote 
something else to stdout after its last write to stderr, that would often be 
enough to avoid the problem, since the pending readline() on stdout would 
complete, then stderr would be drained.  But in a typical error situation, the 
last thing the server does write an error message to stderr, then exit.  So 
here's what happens:

1.  The hg server writes an error message to stderr, which ends up going 
through as multiple writes (often due to Python itself splitting the trailing 
newline from the string).

2.  On the client side, ssh receives the error message as two or more strings.  
It does a write select() with stderr, and is told it can write to it, so it 
writes the first string.

3.  It does another write select() with stderr, but because there is data in 
the pipe (even as little as a single byte), the select() call does not indicate 
that stderr is ready for more data, so it just keeps waiting (e.g. on Redhat 
Linux).

4.  Meanwhile, the hg client is doing a readline() on the ssh client's stdout, 
waiting for either some data, or an empty string when the ssh client exits 
(which is how it should be terminated in this case, since there was an error).

So they're deadlocked:  The hg client will not read from the ssh client's 
stderr until the ssh client exits, causing the hg client's readline() on stdout 
to complete, and the ssh client will not exit until its stderr has been read, 
causing it to write the remaining stderr text before exiting.

See the attached file "select.c" for a simple test of whether your Linux kernel 
has a conservative implementation of select() for pipes (e.g., Redhat) or an 
aggressive implementation (e.g., Ubuntu).  All this test does is create a pipe, 
do a write select on it (which always succeeds), writes a single byte to it, 
then does a second write select, and exits when the select returns.  On Redhat, 
the second select never returns and it hangs.  On Ubuntu, the second select 
completes immediately and it exits.

But the real problem is that the hg client is doing blocking i/o when 
communicating with the ssh client via multiple streams.  This is just not safe, 
and even with an Ubuntu kernel, it can still deadlock.  It needs to multiplex 
all of the relevant streams via select().  Then there will be no chance of 
deadlock even if the server writes huge amounts of data to stderr.

----------
files: select.c
messages: 7144
nosy: tkarzes
priority: bug
status: unread
title: hg client hangs when ssh server reports error

____________________________________________________
Mercurial issue tracker <mercurial-bugs at selenic.com>
<http://www.selenic.com/mercurial/bts/issue1300>
____________________________________________________
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: select.c
Url: http://selenic.com/pipermail/mercurial-devel/attachments/20080916/14075629/attachment.txt