[PATCH STABLE] changegroup: retry read() when partial data returned

Gregory Szorc gregory.szorc at gmail.com
Mon Mar 27 11:25:41 EDT 2017



> On Mar 27, 2017, at 08:13, Yuya Nishihara <yuya at tcha.org> wrote:
> 
>> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc at gmail.com>
>> # Date 1490554582 25200
>> #      Sun Mar 26 11:56:22 2017 -0700
>> # Branch stable
>> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
>> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
>> changegroup: retry read() when partial data returned
>> 
>> We've seen a rash of "stream ended unexpectedly" errors in the
>> wild. This occurs when changegroup.readexactly() fails to retrieve
>> the exact number of requested bytes in a single stream.read()
>> operation.
>> 
>> There are several plausible theories that explain this behavior.
>> Many include underlying APIs not retrying automatically when
>> interrupted by a signal. While Python is supposed to do this,
>> it could be buggy. There could also be a bug in the various
>> stream reading layers between changegroup.readexactly() and the
>> underlying file descriptor.
>> 
>> Unfortunately, getting this error to reproduce has been extremely
>> difficult and no single cause has been identified.
>> 
>> Since networks (and even filesystems) are unreliable, it seems
>> wise to have some form of retry in place to mitigate this failure.
>> This patch adds that retry logic.
>> 
>> There is already basic test coverage that the exception in this
>> function is raised. The tests didn't change with this patch.
>> 
>> The only obvious negative impact to this change I can see is that
>> in cases where the read operation fails, there is some overhead
>> involved with retrying the operation. In the worst case, this is
>> pure overhead: we'd retry an operation and get the same outcome.
>> But in the best case, it avoids an intermittent/random failure.
>> My assumption is that the overhead to retry is small and that
>> the cost to avoiding a random failure is justified.
>> 
>> There are other changes I'd like to make to stream reading code
>> to help flush out this general failure of partial stream reads.
>> See issue4923 for some suggestions with regards to swallowing
>> exception and masking the underlying error. However, these
>> changes aren't suitable for the stable branch. This patch feels
>> minimally invasive and if successful would let us narrow our focus
>> for finding the underlying bug.
>> 
>> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
>> --- a/mercurial/changegroup.py
>> +++ b/mercurial/changegroup.py
>> @@ -35,12 +35,42 @@ from . import (
>> 
>> def readexactly(stream, n):
>>     '''read n bytes from stream.read and abort if less was available'''
>> -    s = stream.read(n)
>> -    if len(s) < n:
>> +    # Most of the time, stream.read() returns exactly the number of bytes
>> +    # requested. Various Python APIs will perform multiple system calls
>> +    # to retrieve more bytes if the first call did not return enough. This
>> +    # includes cases where the system call is interrupted.
>> +    #
>> +    # Even with this retry logic in the Python APIs, we still see failure
>> +    # to retrieve the requested number of bytes. So, we build in our own
>> +    # retry layer here.
>> +    left = n
>> +    chunk = stream.read(n)
>> +    left -= len(chunk)
>> +    assert left >= 0
>> +
>> +    # Common case is it just works.
>> +    if not left:
>> +        return chunk
>> +
>> +    chunks = [chunk]
>> +    # Retry to get remaining data. In cases where the stream has errored or
>> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
>> +    # intermittent failures and isn't common. So the overhead is acceptable.
>> +    for i in range(3):
>> +        chunk = stream.read(left)
>> +        chunks.append(chunk)
>> +        left -= len(chunk)
>> +        assert left >= 0
>> +
>> +        if not left:
>> +            break
> 
> Do you have some number that supports this change actually mitigates the
> problem?

No. I could try to get that this week.

> 
> Suppose underlying functions have bugs, I don't think we can always get
> contiguous chunks with no data loss. If a few bytes lost for example, we
> could read arbitrary part as a length field, and call readexactly() with
> that invalid length value.

Hmm. My assumption is that any error that would result in data loss effectively poisons the stream and prevents future reads.


More information about the Mercurial-devel mailing list