[PATCH STABLE] changegroup: retry read() when partial data returned
Yuya Nishihara
yuya at tcha.org
Mon Mar 27 11:13:18 EDT 2017
On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1490554582 25200
> # Sun Mar 26 11:56:22 2017 -0700
> # Branch stable
> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
> # Parent ed5b25874d998ababb181a939dd37a16ea644435
> changegroup: retry read() when partial data returned
>
> We've seen a rash of "stream ended unexpectedly" errors in the
> wild. This occurs when changegroup.readexactly() fails to retrieve
> the exact number of requested bytes in a single stream.read()
> operation.
>
> There are several plausible theories that explain this behavior.
> Many include underlying APIs not retrying automatically when
> interrupted by a signal. While Python is supposed to do this,
> it could be buggy. There could also be a bug in the various
> stream reading layers between changegroup.readexactly() and the
> underlying file descriptor.
>
> Unfortunately, getting this error to reproduce has been extremely
> difficult and no single cause has been identified.
>
> Since networks (and even filesystems) are unreliable, it seems
> wise to have some form of retry in place to mitigate this failure.
> This patch adds that retry logic.
>
> There is already basic test coverage that the exception in this
> function is raised. The tests didn't change with this patch.
>
> The only obvious negative impact to this change I can see is that
> in cases where the read operation fails, there is some overhead
> involved with retrying the operation. In the worst case, this is
> pure overhead: we'd retry an operation and get the same outcome.
> But in the best case, it avoids an intermittent/random failure.
> My assumption is that the overhead to retry is small and that
> the cost to avoiding a random failure is justified.
>
> There are other changes I'd like to make to stream reading code
> to help flush out this general failure of partial stream reads.
> See issue4923 for some suggestions with regards to swallowing
> exception and masking the underlying error. However, these
> changes aren't suitable for the stable branch. This patch feels
> minimally invasive and if successful would let us narrow our focus
> for finding the underlying bug.
>
> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
> --- a/mercurial/changegroup.py
> +++ b/mercurial/changegroup.py
> @@ -35,12 +35,42 @@ from . import (
>
> def readexactly(stream, n):
> '''read n bytes from stream.read and abort if less was available'''
> - s = stream.read(n)
> - if len(s) < n:
> + # Most of the time, stream.read() returns exactly the number of bytes
> + # requested. Various Python APIs will perform multiple system calls
> + # to retrieve more bytes if the first call did not return enough. This
> + # includes cases where the system call is interrupted.
> + #
> + # Even with this retry logic in the Python APIs, we still see failure
> + # to retrieve the requested number of bytes. So, we build in our own
> + # retry layer here.
> + left = n
> + chunk = stream.read(n)
> + left -= len(chunk)
> + assert left >= 0
> +
> + # Common case is it just works.
> + if not left:
> + return chunk
> +
> + chunks = [chunk]
> + # Retry to get remaining data. In cases where the stream has errored or
> + # is at EOF, this will do a bit of redundant work. But it helps prevent
> + # intermittent failures and isn't common. So the overhead is acceptable.
> + for i in range(3):
> + chunk = stream.read(left)
> + chunks.append(chunk)
> + left -= len(chunk)
> + assert left >= 0
> +
> + if not left:
> + break
Do you have some number that supports this change actually mitigates the
problem?
Suppose underlying functions have bugs, I don't think we can always get
contiguous chunks with no data loss. If a few bytes lost for example, we
could read arbitrary part as a length field, and call readexactly() with
that invalid length value.
More information about the Mercurial-devel
mailing list