A possible explanation for random "stream ended unexpectedly (got m bytes, expected n)"

Sat Apr 8 02:26:13 UTC 2017

On Sat, Mar 25, 2017 at 10:34 AM, Gregory Szorc <gregory.szorc at gmail.com>
wrote:

> On Sat, Mar 25, 2017 at 4:19 AM, Mike Hommey <mh at glandium.org> wrote:
>
>> Hi,
>>
>> I don't know about you, but occasionally, I've hit "stream ended
>> unexpectedly (got m bytes, expected n)" errors that didn't make sense.
>> Retrying would always work.
>>
>> Recently, I was trying to use signal.setitimer and a signal handler for
>> some memory profiling on git-cinnabar, which uses mercurial as a
>> library, and got "stream ended 4 unexpectedly (got m bytes, expected n)"
>> *very* reproducibly. Like, with an interval timer firing every second,
>> it took only a few seconds to hit the error during a clone.
>>
>> I'm pretty sure this can be reproduced with a similar setup in mercurial
>> itself.
>>
>> Now, the reason this happens in this case is that, the code that fails
>> does:
>>
>> def readexactly(stream, n):
>>     '''read n bytes from stream.read and abort if less was available'''
>>     s = stream.read(n)
>>     if len(s) < n:
>>         raise error.Abort(_("stream ended unexpectedly"
>>                            " (got %d bytes, expected %d)")
>>                           % (len(s), n))
>>     return s
>>
>> ... and thanks to POSIX, interrupted reads can lead to short reads. So,
>> you request n bytes, and get less, just because something interrupted
>> the process.  The problem then is that python doesn't let you know why
>> you just got a short read, and you have to figure that out on your own.
>>
>> The same kind of problem is also true to some extent on writes.
>>
>> Now, the problem is that this piece of code is likely the most visible
>> place where the issue exists, but there are many other places in the
>> mercurial code base that are likely affected.
>>
>> And while the signal.setitimer case is a corner case (and I ended up
>> using a separate thread to work around the problem ; my code wasn't
>> interruption safe either anyways), I wonder if those random "stream
>> ended unexpectedly (got m bytes, expected n)" errors I was getting under
>> normal circumstances are not just a manifestation of the same underlying
>> issue, which is that the code doesn't like interrupted reads.
>>
>> Disclaimer: I'm not going to work on fixing this issue, but I figured
>> I'd let you know, in case someone wants to look into it more deeply.
>>
>
> Thank you for writing this up. This "stream ended unexpectedly" has been a
> thorn in my side for a while, as it comes up frequently in Mozilla's CI
> with a frequency somewhere between 1 in 100-1000. Even retrying failed
> operations multiple times isn't enough to overcome it
>
> I have long suspected interrupted system calls as a likely culprit.
> However, when I initially investigated this a few months ago, I found that
> Python's I/O APIs retry automatically for EINTR. See
> https://hg.python.org/cpython/file/54c93e0fe79b/Lib/socket.py#l365 for
> example. This /should/ make e.g. socket._fileobject.read() resilient
> against signal interruption. (If Python's I/O APIs didn't do this, tons of
> programs would break. Also, the semantics of .read() are such that it is
> always supposed to retrieve all available bytes until EOF - at least for
> some io ABCs. read1() exists to perform at most 1 system call.)
>
> But you provide compelling evidence that signal interruptions do in fact
> result in truncated reads. So something is obviously going wrong.
>
> The stream.read() in changegroup.readexactly() actually follows a pretty
> complicated path. Here is what happens when reading from an HTTP response:
>
> 1. stream is a mercurial.httppeer.readerproxy
> 2. The readerproxy is constructed from util._zlibengine.decompressorreader
> and a mercurial.keepalive.HTTPResponse instance
> 3. decompressorreader takes the file object interface from the
> HTTPResponse, constructs an util.filechunkiter, feeds that into a
> zlib.decompressobj and turns that into a util.chunkbuffer
> 4. When you stream.read(), it has to traverse through a util.chunkbuffer,
> a generator of zlib decompressed chunks, a util.filechunkiter, and HTTP
> chunked transfer decoding before it finally gets to a recv() in
> socket._fileobject.read().
>
> Any of those things could have a bug where a signal interrupts things or a
> truncated read isn't retried. I suspect our bug is somewhere in this chain.
>
> I have some other wrinkles to add.
>
> Mozilla sees these stream failures much more frequently on Windows. They
> do happen on Linux and OS X in the wild. But they are >10x more frequent on
> Windows. I'm pretty sure we're running 2.7.12 on Windows (on my insistence
> to help isolate an old Python as the source of the bug).
>
> Mozilla also sees "unexpected negative part header" errors. e.g.
> https://bugzilla.mozilla.org/show_bug.cgi?id=1333348. The code in bundle2
> that is emitting this error also uses changegroup.readexactly. I've long
> suspected the cause of this is a bit flip or reading a truncated or
> uninitialized 4 byte field. I would not at all be surprised if "stream
> ended unexpectedly" and "unexpected negative part header" share the same
> root cause.
>
> Anyway, since you can reproduce this pretty reliably with a constant
> barrage of signals, I reckon the next step is for someone to record program
> execution with rr and find out how signals are resulting in data
> truncation. I was tentatively planning to spend some time next week to look
> into this general problem (because it is plaguing Mozilla). So I guess that
> person will be me.
>
> Thanks again for the write-up. It is very helpful.
>

Mozilla started seeing "stream ended unexpectedly" a lot more frequently
after upgrading server repos to generaldelta. Specifically, clients with
non-generaldelta clones started seeing the failure more often.

I packet traced my way to glory:
https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c78

tl;dr it looks like hg client consumes/buffers server output, becomes idle
over the network for a while because client is taking a while to apply
data, and the server drops the connection, leading to "stream ended
unexpectedly."

Making this confusing is that an OSError for the server-side drop is masked
due to bundle2 trying to recover and read to EOF. Expect some patches from
me for the 4.2 cycle.

I'm still trying to reproduce the signal-based failure. So many root causes
all leading to same "stream ended unexpectedly." The least we can do in 4.2
is make the error reporting better.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170407/2f4e9e3f/attachment.html>