[PATCH STABLE] changegroup: retry read() when partial data returned

Sun Mar 26 18:56:59 UTC 2017

# HG changeset patch
# User Gregory Szorc <gregory.szorc at gmail.com>
# Date 1490554582 25200
#      Sun Mar 26 11:56:22 2017 -0700
# Branch stable
# Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
# Parent  ed5b25874d998ababb181a939dd37a16ea644435
changegroup: retry read() when partial data returned

We've seen a rash of "stream ended unexpectedly" errors in the
wild. This occurs when changegroup.readexactly() fails to retrieve
the exact number of requested bytes in a single stream.read()
operation.

There are several plausible theories that explain this behavior.
Many include underlying APIs not retrying automatically when
interrupted by a signal. While Python is supposed to do this,
it could be buggy. There could also be a bug in the various
stream reading layers between changegroup.readexactly() and the
underlying file descriptor.

Unfortunately, getting this error to reproduce has been extremely
difficult and no single cause has been identified.

Since networks (and even filesystems) are unreliable, it seems
wise to have some form of retry in place to mitigate this failure.
This patch adds that retry logic.

There is already basic test coverage that the exception in this
function is raised. The tests didn't change with this patch.

The only obvious negative impact to this change I can see is that
in cases where the read operation fails, there is some overhead
involved with retrying the operation. In the worst case, this is
pure overhead: we'd retry an operation and get the same outcome.
But in the best case, it avoids an intermittent/random failure.
My assumption is that the overhead to retry is small and that
the cost to avoiding a random failure is justified.

There are other changes I'd like to make to stream reading code
to help flush out this general failure of partial stream reads.
See issue4923 for some suggestions with regards to swallowing
exception and masking the underlying error. However, these
changes aren't suitable for the stable branch. This patch feels
minimally invasive and if successful would let us narrow our focus
for finding the underlying bug.

diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -35,12 +35,42 @@ from . import (
 
 def readexactly(stream, n):
     '''read n bytes from stream.read and abort if less was available'''
-    s = stream.read(n)
-    if len(s) < n:
+    # Most of the time, stream.read() returns exactly the number of bytes
+    # requested. Various Python APIs will perform multiple system calls
+    # to retrieve more bytes if the first call did not return enough. This
+    # includes cases where the system call is interrupted.
+    #
+    # Even with this retry logic in the Python APIs, we still see failure
+    # to retrieve the requested number of bytes. So, we build in our own
+    # retry layer here.
+    left = n
+    chunk = stream.read(n)
+    left -= len(chunk)
+    assert left >= 0
+
+    # Common case is it just works.
+    if not left:
+        return chunk
+
+    chunks = [chunk]
+    # Retry to get remaining data. In cases where the stream has errored or
+    # is at EOF, this will do a bit of redundant work. But it helps prevent
+    # intermittent failures and isn't common. So the overhead is acceptable.
+    for i in range(3):
+        chunk = stream.read(left)
+        chunks.append(chunk)
+        left -= len(chunk)
+        assert left >= 0
+
+        if not left:
+            break
+
+    if left:
         raise error.Abort(_("stream ended unexpectedly"
                            " (got %d bytes, expected %d)")
-                          % (len(s), n))
-    return s
+                          % (n - left, n))
+
+    return b''.join(chunks)
 
 def getchunk(stream):
     """return the next chunk from stream as a string"""