[PATCH STABLE] changegroup: retry read() when partial data returned

Mon Mar 27 11:13:18 EDT 2017

On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1490554582 25200
> #      Sun Mar 26 11:56:22 2017 -0700
> # Branch stable
> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
> changegroup: retry read() when partial data returned
> 
> We've seen a rash of "stream ended unexpectedly" errors in the
> wild. This occurs when changegroup.readexactly() fails to retrieve
> the exact number of requested bytes in a single stream.read()
> operation.
> 
> There are several plausible theories that explain this behavior.
> Many include underlying APIs not retrying automatically when
> interrupted by a signal. While Python is supposed to do this,
> it could be buggy. There could also be a bug in the various
> stream reading layers between changegroup.readexactly() and the
> underlying file descriptor.
> 
> Unfortunately, getting this error to reproduce has been extremely
> difficult and no single cause has been identified.
> 
> Since networks (and even filesystems) are unreliable, it seems
> wise to have some form of retry in place to mitigate this failure.
> This patch adds that retry logic.
> 
> There is already basic test coverage that the exception in this
> function is raised. The tests didn't change with this patch.
> 
> The only obvious negative impact to this change I can see is that
> in cases where the read operation fails, there is some overhead
> involved with retrying the operation. In the worst case, this is
> pure overhead: we'd retry an operation and get the same outcome.
> But in the best case, it avoids an intermittent/random failure.
> My assumption is that the overhead to retry is small and that
> the cost to avoiding a random failure is justified.
> 
> There are other changes I'd like to make to stream reading code
> to help flush out this general failure of partial stream reads.
> See issue4923 for some suggestions with regards to swallowing
> exception and masking the underlying error. However, these
> changes aren't suitable for the stable branch. This patch feels
> minimally invasive and if successful would let us narrow our focus
> for finding the underlying bug.
> 
> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
> --- a/mercurial/changegroup.py
> +++ b/mercurial/changegroup.py
> @@ -35,12 +35,42 @@ from . import (
>  
>  def readexactly(stream, n):
>      '''read n bytes from stream.read and abort if less was available'''
> -    s = stream.read(n)
> -    if len(s) < n:
> +    # Most of the time, stream.read() returns exactly the number of bytes
> +    # requested. Various Python APIs will perform multiple system calls
> +    # to retrieve more bytes if the first call did not return enough. This
> +    # includes cases where the system call is interrupted.
> +    #
> +    # Even with this retry logic in the Python APIs, we still see failure
> +    # to retrieve the requested number of bytes. So, we build in our own
> +    # retry layer here.
> +    left = n
> +    chunk = stream.read(n)
> +    left -= len(chunk)
> +    assert left >= 0
> +
> +    # Common case is it just works.
> +    if not left:
> +        return chunk
> +
> +    chunks = [chunk]
> +    # Retry to get remaining data. In cases where the stream has errored or
> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
> +    # intermittent failures and isn't common. So the overhead is acceptable.
> +    for i in range(3):
> +        chunk = stream.read(left)
> +        chunks.append(chunk)
> +        left -= len(chunk)
> +        assert left >= 0
> +
> +        if not left:
> +            break

Do you have some number that supports this change actually mitigates the
problem?

Suppose underlying functions have bugs, I don't think we can always get
contiguous chunks with no data loss. If a few bytes lost for example, we
could read arbitrary part as a length field, and call readexactly() with
that invalid length value.