[issue1300] hg client hangs when ssh server reports error
Tom Karzes
mercurial-bugs at selenic.com
Mon Sep 15 20:04:34 CDT 2008
New submission from Tom Karzes <Tom.Karzes at magnumsemi.com>:
Using version 1.0.1 of Mercurial, with OpenSSH:
On certain platforms (e.g. Redhat Linux), when accessing a repository via an
ssh server, the local hg client often hangs if the server reports an error
(e.g. a bad repository path, or a hook failure, etc.) Even worse, it is
usually not at all obvious that there's an error until you finally get tired of
waiting and interrupt it, and even then the error messages are often truncated
or even entirely absent.
When we added some server-side hooks, it became clear that we needed a solution
to this problem, so I spent quite a bit of time tracking it down, and I finally
know exactly what's causing it. I believe the hg client is ultimately to
blame, and that a proper fix needs to be made there to truly solve the
problem. But having said that, there are some contributing factors that make
this problem vastly more likely on some systems (e.g. Redhat) than on others
(e.g. Ubuntu).
Here's what the hg client is doing, which I believe should be changed: In
sshrepo.c, it communicates with the ssh client via pipes to the client's stdin,
stdout, and stderr. When reading the ssh client's stdout, it uses blocking
readline() calls. When reading the ssh client's stderr, it uses fstat to see
if there's any data available, then readline() to read it. It effectively
alternates between reading stdout and stderr this way, but this is clearly very
vulnerable to deadlock situations. In particular, if the hg client is blocked
trying to read stdout, while the ssh client is blocked trying to write to
stderr (i.e., waiting for some data to be consumed), there is a deadlock
condition. This could happen under any of a number of conditions. The most
obvious is when one of the pipe buffers fills up, but on some systems (such as
Redhat) this can happen even if the buffer is not full, making this problem
vastly more likely.
The correct way for the hg client to handle all of this on a Linux platform is
to use the select() system call to choose among the ssh client's stdin, stdout,
and stderr, servicing them as they become ready (in the case of stdin, only
when it has data to write). This is exactly what the ssh client does, and it
eliminates deadlock problems: As long as there's *something* it can do, it
does it, and if the thing it's talking to does the same thing, there will be no
deadlock. That is really the only correct way to truly fix this.
So the question is this: Why is this causing deadlock when the only thing the
hg server is doing is printing a one-line error message to stderr and then
exiting? Here is the answer: First, even when doing a single "print" from
Python, the resulting string will often be broken into two parts (Python often
sends the terminating newline in a second string). So it is not uncommon to
get multiple back-to-back writes to stderr.
Back on the client side, the client ssh is doing selects to see when it can
write to stderr. When it reads the first stderr string from the server, it
does the select, Linux tells it that stderr is ready to be written, and it
writes it. Then the second string arrives, and it again does a select to see
when it can write more to stderr. Here's where the os-specific sensitivity
arises: On Ubuntu, select() on a pipe will return ready status even if the
pipe already contains some data. So on Ubuntu systems, the second string gets
written right away, provided the pipe buffer isn't overly full. Then, if that
was the last thing the hg server did before exiting, the ssh client will also
exit, causing any pending readline() on its stdout to immediately exit as well.
On older Linux kernels (such as Redhat), however, select() on a pipe behaves
quite differently. If there is any unread data in the pipe, even as little as
a single byte, select() will not flag the pipe as ready to receive more data,
even if writing to it would succeed. And this is what was triggering all of
the deadlocks we were actually seeing. Note that, if the server wrote
something else to stdout after its last write to stderr, that would often be
enough to avoid the problem, since the pending readline() on stdout would
complete, then stderr would be drained. But in a typical error situation, the
last thing the server does write an error message to stderr, then exit. So
here's what happens:
1. The hg server writes an error message to stderr, which ends up going
through as multiple writes (often due to Python itself splitting the trailing
newline from the string).
2. On the client side, ssh receives the error message as two or more strings.
It does a write select() with stderr, and is told it can write to it, so it
writes the first string.
3. It does another write select() with stderr, but because there is data in
the pipe (even as little as a single byte), the select() call does not indicate
that stderr is ready for more data, so it just keeps waiting (e.g. on Redhat
Linux).
4. Meanwhile, the hg client is doing a readline() on the ssh client's stdout,
waiting for either some data, or an empty string when the ssh client exits
(which is how it should be terminated in this case, since there was an error).
So they're deadlocked: The hg client will not read from the ssh client's
stderr until the ssh client exits, causing the hg client's readline() on stdout
to complete, and the ssh client will not exit until its stderr has been read,
causing it to write the remaining stderr text before exiting.
See the attached file "select.c" for a simple test of whether your Linux kernel
has a conservative implementation of select() for pipes (e.g., Redhat) or an
aggressive implementation (e.g., Ubuntu). All this test does is create a pipe,
do a write select on it (which always succeeds), writes a single byte to it,
then does a second write select, and exits when the select returns. On Redhat,
the second select never returns and it hangs. On Ubuntu, the second select
completes immediately and it exits.
But the real problem is that the hg client is doing blocking i/o when
communicating with the ssh client via multiple streams. This is just not safe,
and even with an Ubuntu kernel, it can still deadlock. It needs to multiplex
all of the relevant streams via select(). Then there will be no chance of
deadlock even if the server writes huge amounts of data to stderr.
----------
files: select.c
messages: 7144
nosy: tkarzes
priority: bug
status: unread
title: hg client hangs when ssh server reports error
____________________________________________________
Mercurial issue tracker <mercurial-bugs at selenic.com>
<http://www.selenic.com/mercurial/bts/issue1300>
____________________________________________________
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: select.c
Url: http://selenic.com/pipermail/mercurial-devel/attachments/20080916/14075629/attachment.txt
More information about the Mercurial-devel
mailing list