A possible explanation for random "stream ended unexpectedly (got m bytes, expected n)"

Sun Jun 11 21:53:28 EDT 2017

On Sun, 11 Jun 2017 16:01:39 -0400, Sean Farley <sean at farley.io> wrote:

> Matt Harbison <mharbison72 at gmail.com> writes:
>
>> On Tue, 06 Jun 2017 22:30:23 -0400, Matt Harbison  
>> <mharbison72 at gmail.com>
>> wrote:
>>>
>>
>> Today, we got it to fail over https with a simple `hg serve` from the  
>> same
>> server (the output there indicated a broken pipe in
>> self._sslobj.write(data) ssl.py:689).  The server side has a 3.9  
>> install,
>> and the repo is not generaldelta.  The server side output looked like it
>> made it further in one attempt than the other.
>>
>> It also failed over http, served from my Linux development machine and  
>> two
>> of my Windows 7 machines.  The client side indicated zlib errors.  I'm
>> starting to wonder if this is a hardware problem of some sort.  That
>> doesn't seem like a satisfying answer though, because apparently they  
>> were
>> remoted into this same client machine for several hours without an  
>> issue,
>> and could also clone a much larger repo.  I would think if the problem  
>> is
>> bad enough to fail every time on this repo, other things would be  
>> affected
>> too.
>
> I haven't tried this with 'hg serve' but I must note that we've seen
> this with windows and also run CentOS 7 on our servers. Not a smoking
> gun, of course.

The latest is we installed another ethernet adapter, and it works fine.   
Swap the cable back to the old one, and it fails the same way.  So that  
pretty much confirms this instance is a hardware problem.

I'm a bit puzzled by the behavior though.  I get that undefined behavior  
is undefined.  But I would have thought that once the network adapter  
transfers the (garbage?) data into memory, TCP would drop bad packets.  If  
things get so bad that the connection is shutdown (as the broken pipe on  
the server side would seem to indicate), I would expect the read to be  
short, and the application layer would notice.

The *.msi has python 2.7.10, but it also fails with 2.7.13 in the Inno  
installer.  Something else interesting is that while Mercurial failed  
transferring a clonebundle, wget was able to fetch it.

I've got a couple high priority fires to deal with, so I probably won't  
get back to it this week.  But if anyone wants to propose patches that  
might help diagnostics and/or making the network I/O more robust, I can  
try those out.