Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: bad g->status in ready #17007

Closed
bradfitz opened this issue Sep 6, 2016 · 11 comments
Closed

runtime: fatal error: bad g->status in ready #17007

bradfitz opened this issue Sep 6, 2016 · 11 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Sep 6, 2016

Seen on a number of trybot runs in the past few weeks:

https://storage.googleapis.com/go-build-log/02de4ef8/windows-amd64-gce_80a1a98e.log

ok      net 22.767s
runtime: gp: gp=0xc0422dcb60, goid=2299, gp->atomicstatus=8
runtime:  g:  g=0xa5cfe0, goid=0,  g->atomicstatus=0
fatal error: bad g->status in ready

runtime stack:
runtime.throw(0x867c02, 0x16)
    C:/workdir/go/src/runtime/panic.go:566 +0x9c
runtime.ready(0xc0422dcb60, 0x3, 0xc04237f501)
    C:/workdir/go/src/runtime/proc.go:540 +0x14b
runtime.goready.func1()
    C:/workdir/go/src/runtime/proc.go:270 +0x3f
runtime.systemstack(0xa5d020)
    C:/workdir/go/src/runtime/asm_amd64.s:298 +0x7e
runtime.mstart()
    C:/workdir/go/src/runtime/proc.go:1079

goroutine 2304 [running]:
runtime.systemstack_switch()
    C:/workdir/go/src/runtime/asm_amd64.s:252 fp=0xc0420eba08 sp=0xc0420eba00
runtime.goready(0xc0422dcb60, 0x3)
    C:/workdir/go/src/runtime/proc.go:271 +0x52 fp=0xc0420eba38 sp=0xc0420eba08
runtime.closechan(0xc04233b800)
    C:/workdir/go/src/runtime/chan.go:371 +0x1c5 fp=0xc0420ebab8 sp=0xc0420eba38
net/http.(*persistConn).closeLocked(0xc04233c6c0, 0xa2e460, 0xc0420f0010)
    C:/workdir/go/src/net/http/transport.go:1933 +0xb1 fp=0xc0420ebb00 sp=0xc0420ebab8
net/http.(*persistConn).close(0xc04233c6c0, 0xa2e460, 0xc0420f0010)
    C:/workdir/go/src/net/http/transport.go:1913 +0x8c fp=0xc0420ebb30 sp=0xc0420ebb00
net/http.(*persistConn).readLoop.func1(0xc04233c6c0, 0xc0420ebc40)
    C:/workdir/go/src/net/http/transport.go:1397 +0x48 fp=0xc0420ebb58 sp=0xc0420ebb30
net/http.(*persistConn).readLoop(0xc04233c6c0)
    C:/workdir/go/src/net/http/transport.go:1470 +0x11d7 fp=0xc0420ebf98 sp=0xc0420ebb58
runtime.goexit()
    C:/workdir/go/src/runtime/asm_amd64.s:2104 +0x1 fp=0xc0420ebfa0 sp=0xc0420ebf98
created by net/http.(*Transport).dialConn
    C:/workdir/go/src/net/http/transport.go:1072 +0x4f2

goroutine 1 [chan receive]:
testing.(*T).Run(0xc04205a0c0, 0x8653e3, 0x13, 0x881ac8, 0xc04206dc01)
    C:/workdir/go/src/testing/testing.go:647 +0x31c
testing.RunTests.func1(0xc04205a0c0)
    C:/workdir/go/src/testing/testing.go:793 +0x6e
testing.tRunner(0xc04205a0c0, 0xc0420ede00)
    C:/workdir/go/src/testing/testing.go:610 +0x88
testing.RunTests(0x880f40, 0xa58660, 0x188, 0x188, 0x64)
    C:/workdir/go/src/testing/testing.go:799 +0x2fc
testing.(*M).Run(0xc0420edee8, 0x404212)
    C:/workdir/go/src/testing/testing.go:743 +0x8c
net/http_test.TestMain(0xc0420edee8)
    C:/workdir/go/src/net/http/main_test.go:19 +0x32
main.main()
    net/http/_test/_testmain.go:880 +0xcd

goroutine 1837 [chan receive]:
testing.(*T).Parallel(0xc0424460c0)
    C:/workdir/go/src/testing/testing.go:552 +0x156
net/http_test.setParallel(0xc0424460c0)
    C:/workdir/go/src/net/http/main_test.go:87 +0x4a
net/http_test.TestServerTimeouts(0xc0424460c0)
    C:/workdir/go/src/net/http/serve_test.go:462 +0xa8
testing.tRunner(0xc0424460c0, 0x8820e0)
    C:/workdir/go/src/testing/testing.go:610 +0x88
created by testing.(*T).Run
    C:/workdir/go/src/testing/testing.go:646 +0x2f2

goroutine 2138 [runnable]:
time.Sleep(0xf4240)
    C:/workdir/go/src/runtime/time.go:59 +0xef
net/http_test.TestTimeoutHandlerRace.func1(0xa31720, 0xc04225d400, 0xc0421325a0)
    C:/workdir/go/src/net/http/serve_test.go:1888 +0xe0
net/http.HandlerFunc.ServeHTTP(0x8821d8, 0xa31720, 0xc04225d400, 0xc0421325a0)
    C:/workdir/go/src/net/http/server.go:1726 +0x4b
net/http.(*timeoutHandler).ServeHTTP.func1(0xc042303080, 0xc04225d400, 0xc0421325a0, 0xc04221b620)
    C:/workdir/go/src/net/http/server.go:2531 +0x5a
created by net/http.(*timeoutHandler).ServeHTTP
    C:/workdir/go/src/net/http/server.go:2533 +0x174

goroutine 2140 [runnable]:
time.Sleep(0xf4240)
    C:/workdir/go/src/runtime/time.go:59 +0xef
net/http_test.TestTimeoutHandlerRace.func1(0xa31720, 0xc04225d540, 0xc042132870)
    C:/workdir/go/src/net/http/serve_test.go:1888 +0xe0
net/http.HandlerFunc.ServeHTTP(0x8821d8, 0xa31720, 0xc04225d540, 0xc042132870)
    C:/workdir/go/src/net/http/server.go:1726 +0x4b
net/http.(*timeoutHandler).ServeHTTP.func1(0xc042303080, 0xc04225d540, 0xc042132870, 0xc04221b860)
    C:/workdir/go/src/net/http/server.go:2531 +0x5a
created by net/http.(*timeoutHandler).ServeHTTP
    C:/workdir/go/src/net/http/server.go:2533 +0x174

goroutine 2305 [select]:
net/http.(*persistConn).writeLoop(0xc04233c6c0)
    C:/workdir/go/src/net/http/transport.go:1659 +0x3c4
created by net/http.(*Transport).dialConn
    C:/workdir/go/src/net/http/transport.go:1073 +0x517

goroutine 2300 [IO wait]:
net.runtime_pollWait(0x223bc28, 0x72, 0xc0420f0848)
    C:/workdir/go/src/runtime/netpoll.go:160 +0x60
net.(*pollDesc).wait(0xc04237f4f0, 0x72, 0xc0420f0848, 0x0)
    C:/workdir/go/src/net/fd_poll_runtime.go:75 +0x3f
net.(*ioSrv).ExecIO(0xc042022020, 0xc04237f3e0, 0x85ea59, 0x8, 0xc04233f120, 0xc042352c00, 0x0, 0x0)
    C:/workdir/go/src/net/fd_windows.go:171 +0x18a
net.(*netFD).acceptOne(0xc04237f380, 0xc0422d1180, 0x2, 0x2, 0xc04237f3e0, 0xc04217d050, 0xc0420bde40, 0xc0420bddc0)
    C:/workdir/go/src/net/fd_windows.go:529 +0x1f8
net.(*netFD).accept(0xc04237f380, 0x0, 0x0, 0x0)
    C:/workdir/go/src/net/fd_windows.go:559 +0x131
net.(*TCPListener).accept(0xc042022368, 0x434c8e, 0xc0420bde70, 0x4c1a24)
    C:/workdir/go/src/net/tcpsock_posix.go:132 +0x35
net.(*TCPListener).Accept(0xc042022368, 0x881158, 0xc04202ab00, 0xa32060, 0xc04217cfc0)
    C:/workdir/go/src/net/tcpsock.go:225 +0x50
net/http.(*Server).Serve(0xc042417880, 0xa31520, 0xc042022368, 0x0, 0x0)
    C:/workdir/go/src/net/http/server.go:2273 +0x1d5
net/http/httptest.(*Server).goServe.func1(0xc04233b5c0)
    C:/workdir/go/src/net/http/httptest/server.go:236 +0x74
created by net/http/httptest.(*Server).goServe
    C:/workdir/go/src/net/http/httptest/server.go:237 +0x63

goroutine 2122 [runnable]:
time.Sleep(0xf4240)
    C:/workdir/go/src/runtime/time.go:59 +0xef
net/http_test.TestTimeoutHandlerRace.func1(0xa31720, 0xc04204b9a0, 0xc04225aa50)
    C:/workdir/go/src/net/http/serve_test.go:1888 +0xe0
net/http.HandlerFunc.ServeHTTP(0x8821d8, 0xa31720, 0xc04204b9a0, 0xc04225aa50)
    C:/workdir/go/src/net/http/server.go:1726 +0x4b
net/http.(*timeoutHandler).ServeHTTP.func1(0xc042303080, 0xc04204b9a0, 0xc04225aa50, 0xc04213ec60)
    C:/workdir/go/src/net/http/server.go:2531 +0x5a
created by net/http.(*timeoutHandler).ServeHTTP
    C:/workdir/go/src/net/http/server.go:2533 +0x174

goroutine 2299 [semacquire]:
sync.runtime_Semacquire(0xc04233c794)
    C:/workdir/go/src/runtime/sema.go:47 +0x37
sync.(*Mutex).Lock(0xc04233c790)
    C:/workdir/go/src/sync/mutex.go:85 +0xa3
net/http.(*persistConn).canceled(0xc04233c6c0, 0x0, 0x0)
    C:/workdir/go/src/net/http/transport.go:1290 +0x56
net/http.(*persistConn).mapRoundTripErrorAfterClosed(0xc04233c6c0, 0x0, 0x0, 0x85d801)
    C:/workdir/go/src/net/http/transport.go:1369 +0x36
net/http.(*persistConn).roundTrip(0xc04233c6c0, 0xc04230a5a0, 0x0, 0x0, 0x0)
    C:/workdir/go/src/net/http/transport.go:1869 +0xa41
net/http.(*Transport).RoundTrip(0xc04221d2c0, 0xc04221d3b0, 0xc04221d2c0, 0x0, 0x0)
    C:/workdir/go/src/net/http/transport.go:381 +0x4f5
net/http.send(0xc04221d3b0, 0xa2c760, 0xc04221d2c0, 0x0, 0x0, 0x0, 0x8, 0xc0422e1c38, 0xc042022380)
    C:/workdir/go/src/net/http/client.go:256 +0x166
net/http.(*Client).send(0xc042145080, 0xc04221d3b0, 0x0, 0x0, 0x0, 0xc042022380, 0x0, 0x1)
    C:/workdir/go/src/net/http/client.go:146 +0x109
net/http.(*Client).doFollowingRedirects(0xc042145080, 0xc04221d3b0, 0x8814e0, 0x16, 0x0, 0x0)
    C:/workdir/go/src/net/http/client.go:529 +0x5e8
net/http.(*Client).Get(0xc042145080, 0xc04230a4e0, 0x16, 0xc04205b080, 0xc04239a9a0, 0x0)
    C:/workdir/go/src/net/http/client.go:419 +0x9a
net/http_test.testHandlerPanic(0xc04205b080, 0xc0422e0000, 0x7d8260, 0xc042283c40)
    C:/workdir/go/src/net/http/serve_test.go:2238 +0x304
net/http_test.TestHandlerPanic_h1(0xc04205b080)
    C:/workdir/go/src/net/http/serve_test.go:2178 +0x85
testing.tRunner(0xc04205b080, 0x881ac8)
    C:/workdir/go/src/testing/testing.go:610 +0x88
created by testing.(*T).Run
    C:/workdir/go/src/testing/testing.go:646 +0x2f2

goroutine 2136 [runnable]:
time.Sleep(0xf4240)
    C:/workdir/go/src/runtime/time.go:59 +0xef
net/http_test.TestTimeoutHandlerRace.func1(0xa31720, 0xc04225d360, 0xc0421324b0)
    C:/workdir/go/src/net/http/serve_test.go:1888 +0xe0
net/http.HandlerFunc.ServeHTTP(0x8821d8, 0xa31720, 0xc04225d360, 0xc0421324b0)
    C:/workdir/go/src/net/http/server.go:1726 +0x4b
net/http.(*timeoutHandler).ServeHTTP.func1(0xc042303080, 0xc04225d360, 0xc0421324b0, 0xc04221b3e0)
    C:/workdir/go/src/net/http/server.go:2531 +0x5a
created by net/http.(*timeoutHandler).ServeHTTP
    C:/workdir/go/src/net/http/server.go:2533 +0x174

goroutine 1971 [chan receive]:
testing.(*T).Parallel(0xc04205bc80)
    C:/workdir/go/src/testing/testing.go:552 +0x156
net/http_test.setParallel(0xc04205bc80)
    C:/workdir/go/src/net/http/main_test.go:87 +0x4a
net/http_test.TestTLSHandshakeTimeout(0xc04205bc80)
    C:/workdir/go/src/net/http/serve_test.go:1026 +0x5d
testing.tRunner(0xc04205bc80, 0x8821a8)
    C:/workdir/go/src/testing/testing.go:610 +0x88
created by testing.(*T).Run
    C:/workdir/go/src/testing/testing.go:646 +0x2f2
FAIL    net/http    3.258s
2016/09/06 21:33:51 Failed: exit status 1
@bradfitz bradfitz added this to the Go1.8 milestone Sep 6, 2016
@bradfitz
Copy link
Contributor Author

bradfitz commented Sep 6, 2016

/cc @mdempsky (it was his most recent trybot failure)

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 11, 2016
@rsc
Copy link
Contributor

rsc commented Oct 27, 2016

Relevant part seems to be that goroutine 2304 is closing pc.closech and thinks it should ready goroutine 2299 to run. In the trace, goroutine 2299 is executing the body of a 'case <-pc.closech' in a select. But clearly it started that code ahead of when goroutine 2304 thought it should. Maybe the synchronization is wrong in the case of a channel being closed and the select might end up both queueing on the channel and noticing the channel is closed and returning without dequeueing. Easy to imagine that being a race that would be rare.

Oddly, I can't find a single instance of this crash or any "status in ready" crash on the dashboard logs, back to the beginning of the year.

atomicstatus=8 is actually _Gcopystack, but that only happens when the goroutine is copying its own stack: it flips from _Grunning to _Gcopystack and then back. Oddly, the stack trace shows the goroutine blocked in semacquire inside a Mutex.Lock, not actually running. Maybe it ran ahead and got to the lock between the start of the panic and the dump, but that seems like a stretch. So the dump shows semacquire, we somehow caught it in _Gcopystack, and between those it must have been _Grunning. A lot of ducks in a row. Hard to believe.

@aclements aclements modified the milestones: Go1.9, Go1.8 Dec 13, 2016
@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz modified the milestones: Go1.8, Go1.9 Dec 20, 2016
@bradfitz
Copy link
Contributor Author

I've marked this Go 1.8 again. This seems not good.

@bradfitz
Copy link
Contributor Author

bradfitz commented Jan 5, 2017

@aclements
Copy link
Member

@bradfitz, do you know if this is only happening on windows-amd64?

Some observations:

This has happened once on the builders (non-trybot): 2016-11-11T19:46:17-39e3cbf/windows-amd64-gce

Logs 1 and 2 fail in closechan, but 3 and 4 fail in send from a select.

Log 1 received from pc.closech on goroutine 2299 just before this panic, which is the channel that's being closed by goroutine 2304. The select is in roundTrip (based on parent commit 8259cf3). According to the traceback, 2304 is in semacquire, which means it had to be in _Grunning (2) a moment ago and is in _Gwaiting (4) as of the traceback. But when the panic was printed, it was in _Gcopystack (8), which is only used for synchronous stack growth. Perhaps something in the select path caused stack growth while the goroutine was visible to the close?

Log 3 panics on goroutine 6749 when sending on sc.wantWriteFrameCh from writeFrameFromHandler (based on parent commit 9def857). The offending goroutine is 6730. I'm not quite sure what 6730 is doing; there are a lot of mentions of "close" in the traceback, which might have closed sc.wantWriteFrameCh, but as far as I can tell this channel never gets closed. Unwinding all the way to serve shows we're in a receive case on sc.wantWriteFrameCh. 6730 is also in the process of panicking, but I'm not sure what that's due to.

@bradfitz
Copy link
Contributor Author

bradfitz commented Jan 5, 2017

I've only seen it happen on windows-amd64, but you have the greppable logs.

I suppose we could run the net/http tests in a loop on some Windows VMs.

@rsc
Copy link
Contributor

rsc commented Jan 5, 2017

Austin and I found this. CL to follow.

@rsc
Copy link
Contributor

rsc commented Jan 5, 2017

Oh, and the bug was introduced in Go 1.4.

@gopherbot
Copy link

CL https://golang.org/cl/34835 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/35637 mentions this issue.

gopherbot pushed a commit that referenced this issue Jan 25, 2017
…ect and stack growth

Fixes #18708 (backport)

To implement the blocking of a select, a goroutine builds a list of
offers to communicate (pseudo-g's, aka sudog), one for each case,
queues them on the corresponding channels, and waits for another
goroutine to complete one of those cases and wake it up. Obviously it
is not OK for two other goroutines to complete multiple cases and both
wake the goroutine blocked in select. To make sure that only one
branch of the select is chosen, all the sudogs contain a pointer to a
shared (single) 'done uint32', which is atomically cas'ed by any
interested goroutines. The goroutine that wins the cas race gets to
wake up the select. A complication is that 'done uint32' is stored on
the stack of the goroutine running the select, and that stack can move
during the select due to stack growth or stack shrinking.

The relevant ordering to block and unblock in select is:

	1. Lock all channels.
	2. Create list of sudogs and queue sudogs on all channels.
	3. Switch to system stack, mark goroutine as asleep,
	   unlock all channels.
	4. Sleep until woken.
	5. Wake up on goroutine stack.
	6. Lock all channels.
	7. Dequeue sudogs from all channels.
	8. Free list of sudogs.
	9. Unlock all channels.

There are two kinds of stack moves: stack growth and stack shrinking.
Stack growth happens while the original goroutine is running.
Stack shrinking happens asynchronously, during garbage collection.

While a channel listing a sudog is locked by select in this process,
no other goroutine can attempt to complete communication on that
channel, because that other goroutine doesn't hold the lock and can't
find the sudog. If the stack moves while all the channel locks are
held or when the sudogs are not yet or no longer queued in the
channels, no problem, because no goroutine can get to the sudogs and
therefore to selectdone. We only need to worry about the stack (and
'done uint32') moving with the sudogs queued in unlocked channels.

Stack shrinking can happen any time the goroutine is stopped.
That code already acquires all the channel locks before doing the
stack move, so it avoids this problem.

Stack growth can happen essentially any time the original goroutine is
running on its own stack (not the system stack). In the first half of
the select, all the channels are locked before any sudogs are queued,
and the channels are not unlocked until the goroutine has stopped
executing on its own stack and is asleep, so that part is OK. In the
second half of the select, the goroutine wakes up on its own goroutine
stack and immediately locks all channels. But the actual call to lock
might grow the stack, before acquiring any locks. In that case, the
stack is moving with the sudogs queued in unlocked channels. Not good.
One goroutine has already won a cas on the old stack (that goroutine
woke up the selecting goroutine, moving it out of step 4), and the
fact that done = 1 now should prevent any other goroutines from
completing any other select cases. During the stack move, however,
sudog.selectdone is moved from pointing to the old done variable on
the old stack to a new memory location on the new stack. Another
goroutine might observe the moved pointer before the new memory
location has been initialized. If the new memory word happens to be
zero, that goroutine might win a cas on the new location, thinking it
can now complete the select (again). It will then complete a second
communication (reading from or writing to the goroutine stack
incorrectly) and then attempt to wake up the selecting goroutine,
which is already awake.

The scribbling over the goroutine stack unexpectedly is already bad,
but likely to go unnoticed, at least immediately. As for the second
wakeup, there are a variety of ways it might play out.

* The goroutine might not be asleep.
That will produce a runtime crash (throw) like in #17007:

	runtime: gp: gp=0xc0422dcb60, goid=2299, gp->atomicstatus=8
	runtime:  g:  g=0xa5cfe0, goid=0,  g->atomicstatus=0
	fatal error: bad g->status in ready

Here, atomicstatus=8 is copystack; the second, incorrect wakeup is
observing that the selecting goroutine is in state "Gcopystack"
instead of "Gwaiting".

* The goroutine might be sleeping in a send on a nil chan.
If it wakes up, it will crash with 'fatal error: unreachable'.

* The goroutine might be sleeping in a send on a non-nil chan.
If it wakes up, it will crash with 'fatal error: chansend:
spurious wakeup'.

* The goroutine might be sleeping in a receive on a nil chan.
If it wakes up, it will crash with 'fatal error: unreachable'.

* The goroutine might be sleeping in a receive on a non-nil chan.
If it wakes up, it will silently (incorrectly!) continue as if it
received a zero value from a closed channel, leaving a sudog queued on
the channel pointing at that zero vaue on the goroutine's stack; that
space will be reused as the goroutine executes, and when some other
goroutine finally completes the receive, it will do a stray write into
the goroutine's stack memory, which may cause problems. Then it will
attempt the real wakeup of the goroutine, leading recursively to any
of the cases in this list.

* The goroutine might have been running a select in a finalizer
(I hope not!) and might now be sleeping waiting for more things to
finalize. If it wakes up, as long as it goes back to sleep quickly
(before the real GC code tries to wake it), the spurious wakeup does
no harm (but the stack was still scribbled on).

* The goroutine might be sleeping in gcParkAssist.
If it wakes up, that will let the goroutine continue executing a bit
earlier than we would have liked. Eventually the GC will attempt the
real wakeup of the goroutine, leading recursively to any of the cases
in this list.

* The goroutine cannot be sleeping in bgsweep, because the background
sweepers never use select.

* The goroutine might be sleeping in netpollblock.
If it wakes up, it will crash with 'fatal error: netpollblock:
corrupted state'.

* The goroutine might be sleeping in main as another thread crashes.
If it wakes up, it will exit(0) instead of letting the other thread
crash with a non-zero exit status.

* The goroutine cannot be sleeping in forcegchelper,
because forcegchelper never uses select.

* The goroutine might be sleeping in an empty select - select {}.
If it wakes up, it will return to the next line in the program!

* The goroutine might be sleeping in a non-empty select (again).
In this case, it will wake up spuriously, with gp.param == nil (no
reason for wakeup), but that was fortuitously overloaded for handling
wakeup due to a closing channel and the way it is handled is to rerun
the select, which (accidentally) handles the spurious wakeup
correctly:

	if cas == nil {
		// This can happen if we were woken up by a close().
		// TODO: figure that out explicitly so we don't need this loop.
		goto loop
	}

Before looping, it will dequeue all the sudogs on all the channels
involved, so that no other goroutine will attempt to wake it.
Since the goroutine was blocked in select before, being blocked in
select again when the spurious wakeup arrives may be quite likely.
In this case, the spurious wakeup does no harm (but the stack was
still scribbled on).

* The goroutine might be sleeping in semacquire (mutex slow path).
If it wakes up, that is taken as a signal to try for the semaphore
again, not a signal that the semaphore is now held, but the next
iteration around the loop will queue the sudog a second time, causing
a cycle in the wakeup list for the given address. If that sudog is the
only one in the list, when it is eventually dequeued, it will
(due to the precise way the code is written) leave the sudog on the
queue inactive with the sudog broken. But the sudog will also be in
the free list, and that will eventually cause confusion.

* The goroutine might be sleeping in notifyListWait, for sync.Cond.
If it wakes up, (*Cond).Wait returns. The docs say "Unlike in other
systems, Wait cannot return unless awoken by Broadcast or Signal,"
so the spurious wakeup is incorrect behavior, but most callers do not
depend on that fact. Eventually the condition will happen, attempting
the real wakeup of the goroutine and leading recursively to any of the
cases in this list.

* The goroutine might be sleeping in timeSleep aka time.Sleep.
If it wakes up, it will continue running, leaving a timer ticking.
When that time bomb goes off, it will try to ready the goroutine
again, leading to any one of the cases in this list.

* The goroutine cannot be sleeping in timerproc,
because timerproc never uses select.

* The goroutine might be sleeping in ReadTrace.
If it wakes up, it will print 'runtime: spurious wakeup of trace
reader' and return nil. All future calls to ReadTrace will print
'runtime: ReadTrace called from multiple goroutines simultaneously'.
Eventually, when trace data is available, a true wakeup will be
attempted, leading to any one of the cases in this list.

None of these fatal errors appear in any of the trybot or dashboard
logs. The 'bad g->status in ready' that happens if the goroutine is
running (the most likely scenario anyway) has happened once on the
dashboard and eight times in trybot logs. Of the eight, five were
atomicstatus=8 during net/http tests, so almost certainly this bug.
The other three were atomicstatus=2, all near code in select,
but in a draft CL by Dmitry that was rewriting select and may or may
not have had its own bugs.

This bug has existed since Go 1.4. Until then the select code was
implemented in C, 'done uint32' was a C stack variable 'uint32 done',
and C stacks never moved. I believe it has become more common recently
because of Brad's work to run more and more tests in net/http in
parallel, which lengthens race windows.

The fix is to run step 6 on the system stack,
avoiding possibility of stack growth.

Fixes #17007 and possibly other mysterious failures.

Change-Id: I9d6575a51ac96ae9d67ec24da670426a4a45a317
Reviewed-on: https://go-review.googlesource.com/34835
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-on: https://go-review.googlesource.com/35637
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Jan 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants