A possible explanation for random "stream ended unexpectedly (got m bytes, expected n)"

Wed Jun 7 02:30:23 UTC 2017

On Fri, 07 Apr 2017 22:26:13 -0400, Gregory Szorc  
<gregory.szorc at gmail.com> wrote:

> On Sat, Mar 25, 2017 at 10:34 AM, Gregory Szorc <gregory.szorc at gmail.com>
> wrote:
>
>> On Sat, Mar 25, 2017 at 4:19 AM, Mike Hommey <mh at glandium.org> wrote:
>>
>>> Hi,
>>>
>>> I don't know about you, but occasionally, I've hit "stream ended
>>> unexpectedly (got m bytes, expected n)" errors that didn't make sense.
>>> Retrying would always work.
>>>
>>> Recently, I was trying to use signal.setitimer and a signal handler for
>>> some memory profiling on git-cinnabar, which uses mercurial as a
>>> library, and got "stream ended 4 unexpectedly (got m bytes, expected  
>>> n)"
>>> *very* reproducibly. Like, with an interval timer firing every second,
>>> it took only a few seconds to hit the error during a clone.
>>>
>>> I'm pretty sure this can be reproduced with a similar setup in  
>>> mercurial
>>> itself.
>>>
>>> Now, the reason this happens in this case is that, the code that fails
>>> does:
>>>
>>> def readexactly(stream, n):
>>>     '''read n bytes from stream.read and abort if less was available'''
>>>     s = stream.read(n)
>>>     if len(s) < n:
>>>         raise error.Abort(_("stream ended unexpectedly"
>>>                            " (got %d bytes, expected %d)")
>>>                           % (len(s), n))
>>>     return s
>>>
>>> ... and thanks to POSIX, interrupted reads can lead to short reads. So,
>>> you request n bytes, and get less, just because something interrupted
>>> the process.  The problem then is that python doesn't let you know why
>>> you just got a short read, and you have to figure that out on your own.
>>>
>>> The same kind of problem is also true to some extent on writes.
>>>
>>> Now, the problem is that this piece of code is likely the most visible
>>> place where the issue exists, but there are many other places in the
>>> mercurial code base that are likely affected.
>>>
>>> And while the signal.setitimer case is a corner case (and I ended up
>>> using a separate thread to work around the problem ; my code wasn't
>>> interruption safe either anyways), I wonder if those random "stream
>>> ended unexpectedly (got m bytes, expected n)" errors I was getting  
>>> under
>>> normal circumstances are not just a manifestation of the same  
>>> underlying
>>> issue, which is that the code doesn't like interrupted reads.
>>>
>>> Disclaimer: I'm not going to work on fixing this issue, but I figured
>>> I'd let you know, in case someone wants to look into it more deeply.
>>>
>>
>> Thank you for writing this up. This "stream ended unexpectedly" has  
>> been a
>> thorn in my side for a while, as it comes up frequently in Mozilla's CI
>> with a frequency somewhere between 1 in 100-1000. Even retrying failed
>> operations multiple times isn't enough to overcome it
>>
>> I have long suspected interrupted system calls as a likely culprit.
>> However, when I initially investigated this a few months ago, I found  
>> that
>> Python's I/O APIs retry automatically for EINTR. See
>> https://hg.python.org/cpython/file/54c93e0fe79b/Lib/socket.py#l365 for
>> example. This /should/ make e.g. socket._fileobject.read() resilient
>> against signal interruption. (If Python's I/O APIs didn't do this, tons  
>> of
>> programs would break. Also, the semantics of .read() are such that it is
>> always supposed to retrieve all available bytes until EOF - at least for
>> some io ABCs. read1() exists to perform at most 1 system call.)
>>
>> But you provide compelling evidence that signal interruptions do in fact
>> result in truncated reads. So something is obviously going wrong.
>>
>> The stream.read() in changegroup.readexactly() actually follows a pretty
>> complicated path. Here is what happens when reading from an HTTP  
>> response:
>>
>> 1. stream is a mercurial.httppeer.readerproxy
>> 2. The readerproxy is constructed from  
>> util._zlibengine.decompressorreader
>> and a mercurial.keepalive.HTTPResponse instance
>> 3. decompressorreader takes the file object interface from the
>> HTTPResponse, constructs an util.filechunkiter, feeds that into a
>> zlib.decompressobj and turns that into a util.chunkbuffer
>> 4. When you stream.read(), it has to traverse through a  
>> util.chunkbuffer,
>> a generator of zlib decompressed chunks, a util.filechunkiter, and HTTP
>> chunked transfer decoding before it finally gets to a recv() in
>> socket._fileobject.read().
>>
>> Any of those things could have a bug where a signal interrupts things  
>> or a
>> truncated read isn't retried. I suspect our bug is somewhere in this  
>> chain.
>>
>> I have some other wrinkles to add.
>>
>> Mozilla sees these stream failures much more frequently on Windows. They
>> do happen on Linux and OS X in the wild. But they are >10x more  
>> frequent on
>> Windows. I'm pretty sure we're running 2.7.12 on Windows (on my  
>> insistence
>> to help isolate an old Python as the source of the bug).
>>
>> Mozilla also sees "unexpected negative part header" errors. e.g.
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1333348. The code in  
>> bundle2
>> that is emitting this error also uses changegroup.readexactly. I've long
>> suspected the cause of this is a bit flip or reading a truncated or
>> uninitialized 4 byte field. I would not at all be surprised if "stream
>> ended unexpectedly" and "unexpected negative part header" share the same
>> root cause.
>>
>> Anyway, since you can reproduce this pretty reliably with a constant
>> barrage of signals, I reckon the next step is for someone to record  
>> program
>> execution with rr and find out how signals are resulting in data
>> truncation. I was tentatively planning to spend some time next week to  
>> look
>> into this general problem (because it is plaguing Mozilla). So I guess  
>> that
>> person will be me.
>>
>> Thanks again for the write-up. It is very helpful.
>>
>
> Mozilla started seeing "stream ended unexpectedly" a lot more frequently
> after upgrading server repos to generaldelta. Specifically, clients with
> non-generaldelta clones started seeing the failure more often.
>
> I packet traced my way to glory:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c78
>
> tl;dr it looks like hg client consumes/buffers server output, becomes  
> idle
> over the network for a while because client is taking a while to apply
> data, and the server drops the connection, leading to "stream ended
> unexpectedly."
>
> Making this confusing is that an OSError for the server-side drop is  
> masked
> due to bundle2 trying to recover and read to EOF. Expect some patches  
> from
> me for the 4.2 cycle.
>
> I'm still trying to reproduce the signal-based failure. So many root  
> causes
> all leading to same "stream ended unexpectedly." The least we can do in  
> 4.2
> is make the error reporting better.

Any more info on this?

We are starting to see this for a handful of developers on various repos.   
So far, every instance has been cloning on Windows 7, from a CentOS 7  
server.  The strange aspect is that it is *very* consistent for the people  
it strikes.  One developer had it happen every time over a period of days  
with both http and https.  But then when he tried to get a Wireshark  
trace, the problem stopped.

Today, a new guy on a fresh install of Windows 7 + updates + Visual Studio  
got the "0 bytes, 4 expected" message on a very small repo (less than 50  
commits I bet).  I upgraded him to 4.2.1 because I remembered there was  
work in this area to not swallow the actual error, and it coughed up these  
hairballs:

d:\dev\sw>hg clonevcs sw/toolchain/tpos --traceback
destination directory: tpos
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
Traceback (most recent call last):
   File "mercurial\scmutil.pyo", line 146, in callcatch
   File "mercurial\dispatch.pyo", line 285, in _runcatchfunc
   File "mercurial\dispatch.pyo", line 912, in _dispatch
   File "mercurial\dispatch.pyo", line 648, in runcommand
   File "mercurial\dispatch.pyo", line 920, in _runcommand
   File "mercurial\dispatch.pyo", line 909, in <lambda>
   File "mercurial\util.pyo", line 1080, in check
   File "mercurial\dispatch.pyo", line 506, in __call__
   File "mercurial\util.pyo", line 1080, in check
   File "mercurial\commands.pyo", line 1561, in clone
   File "mercurial\hg.pyo", line 619, in clone
   File "mercurial\exchange.pyo", line 1238, in pull
   File "mercurial\exchange.pyo", line 1378, in _pullbundle2
   File "mercurial\bundle2.pyo", line 387, in processbundle
SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1752)
abort: error: WRONG_VERSION_NUMBER

d:\dev\sw>hg clonevcs sw/toolchain/tpos --traceback --config  
extensions.mercurial_keyring=!
http authorization required for
https://vcs.attotech.com/hg/sw/toolchain/tpos
realm: SONIA :: SCM Manager
user: ssiwinski
password:
destination directory: tpos
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
Traceback (most recent call last):
   File "mercurial\scmutil.pyo", line 146, in callcatch
   File "mercurial\dispatch.pyo", line 285, in _runcatchfunc
   File "mercurial\dispatch.pyo", line 912, in _dispatch
   File "mercurial\dispatch.pyo", line 648, in runcommand
   File "mercurial\dispatch.pyo", line 920, in _runcommand
   File "mercurial\dispatch.pyo", line 909, in <lambda>
   File "mercurial\util.pyo", line 1080, in check
   File "mercurial\dispatch.pyo", line 506, in __call__
   File "mercurial\util.pyo", line 1080, in check
   File "mercurial\commands.pyo", line 1561, in clone
   File "mercurial\hg.pyo", line 619, in clone
   File "mercurial\exchange.pyo", line 1238, in pull
   File "mercurial\exchange.pyo", line 1378, in _pullbundle2
   File "mercurial\bundle2.pyo", line 387, in processbundle
SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or  
bad record mac (_ssl.c:1752)
abort: error: DECRYPTION_FAILED_OR_BAD_RECORD_MAC

For him, it worked fine over http.  (The schemes extension is hiding the  
https:// here.)

The thing I'm going to try tomorrow is to do `hg serve` with the  
certificate like the test suite, to try to eliminate tomcat and scmmanager  
as suspects.  I forgot about the generaldelta tidbit, so I'll have to  
check that tomorrow too.  The repos were created with 3.9 IIRC, and  
everybody should be on 4.0.2 locally.  (Is there a debug command for that  
with 3.9?)

Any other suggestions?  If I can find replacement hardware, I might  
commandeer the machine.  But since the problem tends to disappear, IDK if  
it will help.