6033 – Weird lfs test failure: reliable on FreeBSD, flaky on linux

Bug 6033 - Weird lfs test failure: reliable on FreeBSD, flaky on linux

Summary: Weird lfs test failure: reliable on FreeBSD, flaky on linux

Status:	RESOLVED FIXED

Alias:	None

Product:	Mercurial
Classification:	Unclassified
Component:	hgweb (show other bugs)
Version:	default branch
Hardware:	PC Mac OS

Importance:	wish bug
Assignee:	Bugzilla

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-12-05 10:55 UTC by Augie Fackler
Modified:	2019-01-17 00:00 UTC (History)
CC List:	3 users (show)

See Also:
Python Version:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Augie Fackler 2018-12-05 10:55 UTC

I can't make sense of this failure, but it's keeping the BSD buildbot from being green, and it occasionally shows up on Linux too.

--- /usr/home/hg/buildslave/FreeBSD_hg_tests/build/tests/test-lfs-serve-access.t
+++ /usr/home/hg/buildslave/FreeBSD_hg_tests/build/tests/test-lfs-serve-access.t.err
@@ -161,11 +161,13 @@
 
   $ cat $TESTTMP/access.log $TESTTMP/errors.log
   $LOCALIP - - [$LOGDATE$] "POST /missing/objects/batch HTTP/1.1" 404 - (glob)
+  172.16.18.2 - - [05/Dec/2018 13:01:14] "{"objects": [{"oid": "f03217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e", "size": 20}], "operation": "download"}" 400 -
   $LOCALIP - - [$LOGDATE$] "GET /subdir/mount/point?cmd=capabilities HTTP/1.1" 200 - (glob)
   $LOCALIP - - [$LOGDATE$] "GET /subdir/mount/point?cmd=batch HTTP/1.1" 200 - x-hgarg-1:cmds=heads+%3Bknown+nodes%3D x-hgproto-1:0.1 0.2 comp=$USUAL_COMPRESSIONS$ partial-pull (glob)
   $LOCALIP - - [$LOGDATE$] "GET /subdir/mount/point?cmd=getbundle HTTP/1.1" 200 - x-hgarg-1:bookmarks=1&bundlecaps=HG20%2Cbundle2%3DHG20%250Abookmarks%250Achangegroup%253D01%252C02%252C03%250Adigests%253Dmd5%252Csha1%252Csha512%250Aerror%253Dabort%252Cunsupportedcontent%252Cpushraced%252Cpushkey%250Ahgtagsfnodes%250Alistkeys%250Aphases%253Dheads%250Apushkey%250Aremote-changegroup%253Dhttp%252Chttps%250Arev-branch-cache%250Astream%253Dv2&cg=1&common=0000000000000000000000000000000000000000&heads=525251863cad618e55d483555f3d00a2ca99597e&listkeys=bookmarks&phases=1 x-hgproto-1:0.1 0.2 comp=$USUAL_COMPRESSIONS$ partial-pull (glob)
   $LOCALIP - - [$LOGDATE$] "POST /subdir/mount/point/.git/info/lfs/objects/batch HTTP/1.1" 200 - (glob)
   $LOCALIP - - [$LOGDATE$] "GET /subdir/mount/point/.hg/lfs/objects/f03217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e HTTP/1.1" 200 - (glob)
+  172.16.18.2 - - [05/Dec/2018 13:01:14] code 400, message Bad request syntax ('{"objects": [{"oid": "f03217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e", "size": 20}], "operation": "download"}')
 
 Blobs that already exist in the usercache are linked into the repo store, even
 though the client doesn't send the blob.
@@ -193,6 +195,7 @@
   server2/.hg/store/lfs/objects/f0/3217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e
   $ "$PYTHON" $RUNTESTDIR/killdaemons.py $DAEMON_PIDS
   $ cat $TESTTMP/errors.log
+  172.16.18.2 - - [05/Dec/2018 13:01:14] code 400, message Bad request syntax ('{"objects": [{"oid": "f03217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e", "size": 20}], "operation": "download"}')
 
   $ cat >> $TESTTMP/lfsstoreerror.py <<EOF
   > import errno

ERROR: test-lfs-serve-access.t output changed

Comment 1 Augie Fackler 2018-12-05 10:58 UTC

+mharbison in case he's got ideas

Comment 2 Augie Fackler 2018-12-05 10:58 UTC

Note that if access to a FreeBSD system (where this is a reliable failure, not a flake) would help, that can almost certainly be arranged.

Comment 3 Matt Harbison 2018-12-05 12:49 UTC

Looks like something I broke recently, though I have no idea what FreeBSD objects to, or why it would be flaky on Linux.

The first bad revision is:
changeset:   40660:9f78d10742af
user:        Matt Harbison <matt_harbison@yahoo.com>
date:        Thu Nov 15 17:58:59 2018 -0500
summary:     lfs: improve the hints for common errors in the Batch API

Any ideas?

As an aside, I ran 9f78d10742af^ with --loop to make sure it wasn't also flaky on this FreeBSD system, and I ended up having to Ctrl+C twice.

$ ./run-tests.py --local test-lfs-serve-access.t --loop
running 1 tests using 1 parallel processes
.............^CINTERRUPTED: test-lfs-serve-access.t (after 1 seconds)
^C
# Ran 14 tests, 0 skipped, 0 failed.

Comment 4 Augie Fackler 2018-12-05 12:51 UTC

(In reply to Matt Harbison from comment #3)
My experience is that usually flake-on-linux-fail-on-BSD is a result of a subtle difference in the IO subsystems, and there's a real race condition _somewhere_ that FreeBSD is lucky enough to turn into a legit failure.

Comment 5 Matt Harbison 2018-12-05 15:53 UTC

(In reply to Augie Fackler from comment #4)

I think that you're on to something with the race, but I don't understand why.  It looks like this is a problem in hgweb.server, as the LFS server isn't even getting notified.  Here's a simplified test showing the origin of the problem.  (It's accessing an non served path):

diff --git a/tests/test-lfs-serve-access.t b/tests/test-lfs-serve-access.t
--- a/tests/test-lfs-serve-access.t
+++ b/tests/test-lfs-serve-access.t
@@ -81,6 +81,12 @@ Reasonable hint for a misconfigured blob
   (the "lfs.url" config may be used to override http://localhost:$HGPORT/missing)
   [255]

+  $ cat $TESTTMP/access.log
+  $LOCALIP - - [$LOGDATE$] "POST /missing/objects/batch HTTP/1.1" 404 -
+  $LOCALIP - - [05/Dec/2018 20:30:56] "{"objects": [{"oid": "f03217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e", "size": 20}], "operation": "download"}" 400 -
+  $ cat $TESTTMP/errors.log
+  $LOCALIP - - [05/Dec/2018 20:30:56] code 400, message Bad request syntax ('{"objects": [{"oid": "f03217a32529a28a42d03b1244fe09b6e0f9fd06d7b966d4d50567be2abe6c0e", "size": 20}], "operation": "download"}')
+
 Blob URIs are correct when --prefix is used

   $ hg clone --debug http://localhost:$HGPORT/subdir/mount/point cloned2

Notice the client *does* get the 404.  The JSON is that request's body.  So it seems that it's processing the body of the request as the next request?

I copied the exception handler in do_POST() to implement the missing path logic in hgweb.server.do_hgweb(), so I wonder if that would be affected too.  I tried reading 32K from self.rfile here, but the test hung.  I also tried forcing the use of socketserver.ForkingMixIn, which created other errors.  Forcing the empty _mixin class plain hung.  So I wonder if this cruft should go- even Windows seems to have the threading mixin (though for some reason I thought that we didn't use it on Windows).

Updated the branch and component info to reflect my current understanding of things.  I'm not sure how this is supposed to be handled properly in the server.

Comment 6 Matt Harbison 2018-12-26 19:15 UTC

Seemingly important extra info here:

https://www.mercurial-scm.org/pipermail/mercurial-devel/2018-December/126287.html

I still have no idea, though it still seems webserver related.  Any ideas @indygreg?

Comment 7 Augie Fackler 2019-01-06 15:01 UTC

Still flaky on Linux for me. I noticed the error message is weird for hg ("Bad request syntax") and that led me to figure out where it comes from: it's coming from BaseHTTPServer (line 286 on my install), presumably because the client started sending the POST body and the server (somehow) was too quick to respond and treated the POST body line it was a new request line.

Looking further, there's a spot where we bypass most of hgweb, and therefore don't drain the POST body when sending a 404, and it looks like only this LFS test was testing invalid-endpoint behavior that tickled this. I've got a patch I'll send out soon.

Comment 8 Matt Harbison 2019-01-06 23:23 UTC

(In reply to Augie Fackler from comment #7)

Nice find, and thanks for the explanation.

Any thoughts on the various mixin uses mentioned toward the end of comment #5?  I don't want to cleanup things that I don't understand.

Comment 9 HG Bot 2019-01-07 18:30 UTC

Fixed by https://mercurial-scm.org/repo/hg/rev/197f092b2cd9
Augie Fackler <raf@durin42.com>
server: always close http socket if responding with an error (issue6033)

It's possible for hgweb to respond _very_ early with an error if we're
catching certain types of errors. When we do, we need to tell the client
the socket is toast when there's a POST involved because otherwise there
can be lingering POST data on the socket that will confuse any future
requests on the socket. This manifested as a flaky failure on Linux in an
lfs extension test and a reliable failure on FreeBSD. With this patch
applied, test-lfs-serve-access.t now passes for me on FreeBSD.

Differential Revision: https://phab.mercurial-scm.org/D5498

(please test the fix)

Comment 10 HG Bot 2019-01-07 18:38 UTC

Fixed by https://mercurial-scm.org/repo/hg/rev/0a7f582f6f1f
Matt Harbison <matt_harbison@yahoo.com>
largefiles: port wrapped functions to exthelper

Things get interesting in the commit.  I hadn't seen issue6033 on Windows, and
yet it is now reproducible 100% of the time on Windows 10 with this commit.  I
didn't test Linux.  (For comparison, after seeing this issue, I tested on the
parent with --loop, and it failed 5 times out of over 1300 tests.)

The strange thing is that largefiles has nothing to do with that test (it's not
even mentioned there).  It isn't autoloading run amuck- it occurs even if
largefiles is explicitly disabled, and also if the entry in afterhgrcload() is
commented out.  It's also not the import of lfutil- I disabled that by copying
the function into lfs and removing the import, and the problem still occurs.

Experimenting further, it seems that the problem is isolated to 3 entries:
exchange.pushoperation, hg.clone, and cmdutil.revert.  If those decorators are
all commented out, the test passes when run in a loop for awhile.  (Obviously,
some largefiles tests will fail.)  But if any one is commented back in, the test
fails immediately.

I left one method related to wrapping the wire protocol, because it seemed more
natural with the TODO.  Also, exthelper doesn't support wrapping functions from
another extension, only commands in another extension.  I didn't try to figure
out why rebase is both command wrapped and function wrapped.

(please test the fix)

Comment 11 HG Bot 2019-01-07 18:38 UTC

Fixed by https://mercurial-scm.org/repo/hg/rev/70ca0e846d25
Matt Harbison <matt_harbison@yahoo.com>
exthelper: reintroduce the ability to register templates

Same mechanism as revsets earlier in this series.  The LFS extension is updated
to provide test coverage.

This also seems to make the test failure around issue6033 mentioned in
0a7f582f6f1f much less regular on Windows.

(please test the fix)

Comment 12 Augie Fackler 2019-01-09 15:49 UTC

(In reply to Matt Harbison from comment #8)
I honestly have no idea. The mixins confuse me too.

Comment 13 Bugzilla 2019-01-17 00:00 UTC

Bug was set to TESTING for 7 days, resolving