Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: error message: P has cached GC work at end of mark termination #27993

Open
ianlancetaylor opened this issue Oct 3, 2018 · 59 comments
Open
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. GarbageCollector NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@ianlancetaylor
Copy link
Contributor

I just saw this in a trybot run on OpenBSD amd64: https://storage.googleapis.com/go-build-log/e4418337/openbsd-amd64-62_9bec130b.log

greplogs shows:

2018-10-03T02:09:38-06ff477/freebsd-386-10_4
2018-10-03T17:40:17-048de7b/freebsd-386-11_2
2018-10-03T19:54:23-9dac0a8/linux-amd64-nocgo

CC @aclements @RLH

@aarzilli
Copy link
Contributor

@FiloSottile
Copy link
Contributor

@ianlancetaylor
Copy link
Contributor Author

@bcmills
Copy link
Contributor

bcmills commented Nov 1, 2018

@ianlancetaylor
Copy link
Contributor Author

@griesemer
Copy link
Contributor

@ianlancetaylor
Copy link
Contributor Author

@griesemer
Copy link
Contributor

griesemer commented Nov 5, 2018 via email

@odeke-em
Copy link
Member

odeke-em commented Nov 9, 2018

Hey @griesemer in your comment #27993 (comment)

See #27993 .

you self referenced this same issue 27993, which issue did you mean to refer to?

@griesemer
Copy link
Contributor

@odeke-em This was just an e-mail reply - I missed that the mail already referred to this issue. Please ignore.

@aclements
Copy link
Member

$ greplogs -e 'P has cached GC work' -dashboard -md -l
2018-11-16T23:30:19-6797b32/netbsd-amd64-8_0
2018-11-16T19:46:17-6157dda/linux-amd64-stretch
2018-11-16T17:49:55-53ed92d/nacl-386
2018-11-16T17:33:54-a1025ba/nacl-386
2018-11-14T21:47:50-0a40d45/freebsd-amd64-11_1
2018-11-14T20:32:15-f36e92d/android-arm-wiko-fever
2018-11-13T20:46:39-af07f77/windows-386-2008
2018-11-13T15:08:13-e51b19a/freebsd-amd64-10_3
2018-11-12T20:46:39-7f1dd3a/linux-386-387
2018-11-12T20:46:25-52b2220/windows-amd64-2008
2018-11-12T20:43:55-106db71/linux-386-387
2018-11-12T20:27:21-ec4ae29/linux-mipsle
2018-11-12T20:27:14-4f3604d/android-arm64-wiko-fever
2018-11-10T16:04:18-e4c1fee/solaris-amd64-oraclerel
...
2018-10-11T16:31:24-689321e/freebsd-386-11_2
2018-10-10T14:55:17-29907b1/freebsd-386-10_4
2018-10-10T04:29:55-8f9902d/freebsd-amd64-10_4
2018-10-09T18:20:23-416804f/windows-386-2008
2018-10-09T18:19:59-7d2f46d/freebsd-amd64-10_3
2018-10-08T17:47:49-26d2260/windows-arm
2018-10-06T19:18:34-2bb91e0/nacl-amd64p32
2018-10-06T15:40:03-f90e89e/linux-amd64-nocgo
2018-10-05T21:53:34-9d90716/freebsd-amd64-race
2018-10-03T22:50:25-1961d8d/freebsd-amd64-10_4
2018-10-03T19:54:23-9dac0a8/linux-amd64-nocgo
2018-10-03T17:40:17-048de7b/freebsd-386-11_2
2018-10-03T02:09:38-06ff477/freebsd-386-10_4

It clearly happens on all GOOSes:

      1 darwin
      1 js
      1 solaris
      2 android
      2 netbsd
      3 plan9
      4 openbsd
      5 nacl
      8 windows
     19 linux
     23 freebsd

And has shown up on plenty of GOARCHes:

      1 arm64
      1 wasm
      2 amd64p32
      3 mipsle
      5 arm
     24 386
     33 amd64

The failure is usually, though not always, in cmd/go, but that might just be sampling bias, and the exact subcommand that's running during the failure varies.

@gopherbot
Copy link

Change https://golang.org/cl/149968 mentions this issue: runtime: improve "P has cached GC work" debug info

@gopherbot
Copy link

Change https://golang.org/cl/149969 mentions this issue: runtime: debug code to catch bad gcWork.puts

@bradfitz
Copy link
Contributor

gopherbot pushed a commit that referenced this issue Nov 21, 2018
For #27993.

Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e
Reviewed-on: https://go-review.googlesource.com/c/149968
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/150778 mentions this issue: Revert "runtime: debug code to catch bad gcWork.puts"

gopherbot pushed a commit that referenced this issue Nov 21, 2018
This adds a debug check to throw immediately if any pointers are added
to the gcWork buffer after the mark completion barrier. The intent is
to catch the source of the cached GC work that occasionally produces
"P has cached GC work at end of mark termination" failures.

The result should be that we get "throwOnGCWork" throws instead of "P
has cached GC work at end of mark termination" throws, but with useful
stack traces.

This should be reverted before the release. I've been unable to
reproduce this issue locally, but this issue appears fairly regularly
on the builders, so the intent is to catch it on the builders.

This probably slows down the GC slightly.

For #27993.

Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c
Reviewed-on: https://go-review.googlesource.com/c/149969
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
bradfitz pushed a commit that referenced this issue Nov 21, 2018
For #27993.

Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e
Reviewed-on: https://go-review.googlesource.com/c/149968
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
bradfitz pushed a commit that referenced this issue Nov 21, 2018
This adds a debug check to throw immediately if any pointers are added
to the gcWork buffer after the mark completion barrier. The intent is
to catch the source of the cached GC work that occasionally produces
"P has cached GC work at end of mark termination" failures.

The result should be that we get "throwOnGCWork" throws instead of "P
has cached GC work at end of mark termination" throws, but with useful
stack traces.

This should be reverted before the release. I've been unable to
reproduce this issue locally, but this issue appears fairly regularly
on the builders, so the intent is to catch it on the builders.

This probably slows down the GC slightly.

For #27993.

Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c
Reviewed-on: https://go-review.googlesource.com/c/149969
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@aclements
Copy link
Member

I got a little data over the weekend with the added debug code.

$ greplogs -dashboard -E "flushedWork|throwOnGCWork" -md
2018-11-26T14:13:53-9fe9853/darwin-amd64-10_12:

runtime: P 5 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-23T21:21:35-c6d4939/linux-amd64-noopt:

fatal error: throwOnGCWork

runtime stack:
runtime.throw(0x6f386c, 0xd)
	/workdir/go/src/runtime/panic.go:608 +0x72 fp=0x7f034b7fdd48 sp=0x7f034b7fdd18 pc=0x42f4a2
runtime.(*gcWork).putBatch(0xc000021770, 0xc0000217a8, 0x13, 0x200)
	/workdir/go/src/runtime/mgcwork.go:182 +0x1e0 fp=0x7f034b7fdd90 sp=0x7f034b7fdd48 pc=0x424aa0
runtime.wbBufFlush1(0xc000020500)
	/workdir/go/src/runtime/mwbbuf.go:277 +0x1ba fp=0x7f034b7fdde8 sp=0x7f034b7fdd90 pc=0x42b87a
runtime.gcMark(0xa4a16143b5)
	/workdir/go/src/runtime/mgc.go:1932 +0x10b fp=0x7f034b7fde80 sp=0x7f034b7fdde8 pc=0x41e22b
runtime.gcMarkTermination.func1()
	/workdir/go/src/runtime/mgc.go:1501 +0x2a fp=0x7f034b7fde98 sp=0x7f034b7fde80 pc=0x45c57a
runtime.systemstack(0x1)
	/workdir/go/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f034b7fdea0 sp=0x7f034b7fde98 pc=0x45f336
runtime.mstart()
	/workdir/go/src/runtime/proc.go:1153 fp=0x7f034b7fdea8 sp=0x7f034b7fdea0 pc=0x4339c0

goroutine 181 [garbage collection]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:311 fp=0xc000342d60 sp=0xc000342d58 pc=0x45f2c0
runtime.gcMarkTermination(0x3fe295012d50d6c8)
	/workdir/go/src/runtime/mgc.go:1500 +0x178 fp=0xc000342f20 sp=0xc000342d60 pc=0x41d218
runtime.gcMarkDone()
	/workdir/go/src/runtime/mgc.go:1475 +0x168 fp=0xc000342f60 sp=0xc000342f20 pc=0x41cff8
runtime.gcBgMarkWorker(0xc00001e000)
	/workdir/go/src/runtime/mgc.go:1858 +0x294 fp=0xc000342fd8 sp=0xc000342f60 pc=0x41df14
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000342fe0 sp=0xc000342fd8 pc=0x461391
created by runtime.gcBgMarkStartWorkers
	/workdir/go/src/runtime/mgc.go:1679 +0x77

2018-11-21T16:28:17-2a7f904/linux-386-clang:

runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-21T16:28:17-2a7f904/linux-mipsle:

runtime: P 0 flushedWork false wbuf1.n=2 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

@aclements
Copy link
Member

It's really interesting that in three of the four failures, it did not trip the check on adding work to the gcWork. There's a pretty narrow window when we're still running the termination detection algorithm where work can still be added without triggering this check. Perhaps for debugging we should do a second round after setting throwOnGCWork to see if anything was added during this window.

In the one throwOnGCWork failure, it was flushing a write barrier buffer that apparently had 19 (0x13) unmarked pointers in it. Perhaps for debugging we should disable the write barrier buffers, too, when we set throwOnGCWork.

@aclements
Copy link
Member

aclements commented Nov 26, 2018

FWIW, I was able to reproduce this once out of 1,334 runs of all.bash on my linux/amd64 box.

##### ../test
# go run run.go -- recover2.go
signal: aborted
runtime: P 5 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

@andybons
Copy link
Member

@aclements how serious is this and how long do you think it will take to fix (worst question to ask I know)? We’re asking because we may delay the beta due to it so getting more insight into the scope of the issue will help us with that decision. Thanks.

@bcmills
Copy link
Contributor

bcmills commented Nov 28, 2018

Another repro in https://storage.googleapis.com/go-build-log/f64c1069/windows-386-2008_34d5c37b.log, if it helps:

            > go list -f '{{.ImportPath}}: {{.Match}}' all ... example.com/m/... ./... ./xyz...
            [stderr]
            runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
            fatal error: P has cached GC work at end of mark termination

@gopherbot
Copy link

Change https://golang.org/cl/156017 mentions this issue: runtime: don't spin in checkPut if non-preemptible

@FiloSottile
Copy link
Contributor

@aclements We're assuming we should wait on this for RC1, let us know if you feel otherwise.

gopherbot pushed a commit that referenced this issue Jan 2, 2019
Currently it's possible for the runtime to deadlock if checkPut is
called in a non-preemptible context. In this case, checkPut may spin,
so it won't leave the non-preemptible context, but the thread running
gcMarkDone needs to preempt all of the goroutines before it can
release the checkPut spin loops.

Fix this by returning from checkPut if it's called under any of the
conditions that would prevent gcMarkDone from preempting it. In this
case, it leaves a note behind that this happened; if the runtime does
later detect left-over work it can at least indicate that it was
unable to catch it in the act.

For #27993.
Updates #29385 (may fix it).

Change-Id: Ic71c10701229febb4ddf8c104fb10e06d84b122e
Reviewed-on: https://go-review.googlesource.com/c/156017
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@mark-rushakoff
Copy link
Contributor

CL 156017 didn't prevent the late gcWork put error (but I think that was expected).

Using go version devel +64096dbb69 Wed Jan 2 21:21:53 2019 +0000 darwin/amd64, and repeatedly running go test -race -run=Task github.com/influxdata/platform/http from https://github.com/influxdata/platform/commit/278f67925cc197f57a32244546fc660df2712e6e (because that was a case where I happened to see a late gcWork put in the past), here's another 20+ stack traces over the course of running that in a loop for about 16 hours.

1546522489-22255.txt
1546521051-27381.txt
1546515641-1038.txt
1546515404-17480.txt
1546510336-4510.txt
1546506868-9166.txt
1546506362-31361.txt
1546504512-19560.txt
1546502232-32045.txt
1546501261-30542.txt
1546499267-31881.txt
1546495909-24024.txt
1546493582-31537.txt
1546493514-32457.txt
1546491250-1139.txt
1546491175-20329.txt
1546490394-25780.txt
1546490041-8027.txt
1546488386-29902.txt
1546484304-6055.txt
1546483677-8850.txt
1546481108-15098.txt
1546478442-22548.txt
1546471442-16184.txt

@gopherbot
Copy link

Change https://golang.org/cl/156140 mentions this issue: runtime: work around "P has cached GC work" failures

@aclements
Copy link
Member

@mark-rushakoff, thanks for the pointer to the influx tests. I'll see if I can reproduce it as easily using those.

For reference, I had to install bzr and fetch that package in modules-mode since the HEAD of some of its dependencies is broken:

mkdir /tmp/z
cd /tmp/z
echo "module m" > go.mod
go get -d github.com/influxdata/platform/http@278f679
go test -c -race github.com/influxdata/platform/http

gopherbot pushed a commit that referenced this issue Jan 4, 2019
We still don't understand what's causing there to be remaining GC work
when we enter mark termination, but in order to move forward on this
issue, this CL implements a work-around for the problem.

If debugCachedWork is false, this CL does a second check for remaining
GC work as soon as it stops the world for mark termination. If it
finds any work, it starts the world again and re-enters concurrent
mark. This will increase STW time by a small amount proportional to
GOMAXPROCS, but fixes a serious correctness issue.

This works-around #27993.

Change-Id: Ia23b85dd6c792ee8d623428bd1a3115631e387b8
Reviewed-on: https://go-review.googlesource.com/c/156140
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
@aclements
Copy link
Member

CL 156140 has been submitted, which works around this issue, so I'm going to remove release-blocker. We still don't understand the root cause, however, so the issue will remain open.

@gopherbot
Copy link

Change https://golang.org/cl/156318 mentions this issue: doc/go1.12: remove known issue note

gopherbot pushed a commit that referenced this issue Jan 4, 2019
A workaround has been submitted.

Updates #27993

Change-Id: Ife6443c32673b38000b90dd2efb2985db37ab773
Reviewed-on: https://go-review.googlesource.com/c/156318
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@aclements
Copy link
Member

I think I've figured it out.

I've been running the influx test with a lot of extra runtime debug logging and have been getting some interesting output. Here's an annotated subset of a failure (go-stress-20190104T162318-083562928; I've posted just the last GC cycle since the log is huge).

The unmarked object is 0xc000575800, which is a _defer.

log notes
[1.915567099 P 1] setGCPhase 1 Start last GC. 0xc000575800 is 2nd in some G's defer list. WB buffer is full size right now, and we only log on flush, so WB logging can be delayed.
... Lots of stuff on all Ps
[1.922270499 P 2] deferreturn 0xc000602740 link 0xc000575800 Now head of list. gp._defer = d.link should have added to WB buf as new value.
[1.922273799 P 2] deferreturn 0xc000575800 link 0x0 Running defer. gp._defer = d.link should have added to WB buf as old value.
[1.922274257 P 2] freedefer 0xc000575800 Should have added to WB buf as new value when inserting into pool
[1.922275588 P 2] newdefer from pool 0xc000575800 defer gotten from pool. Again should have added to WB buf
[1.922276023 P 2] newdefer link defer 0x0 Now head of list. Should have added to WB buf
... No logs from P 2. Notably, no WB marks from P 2.
[1.922292123 P 1] gcMarkDone flush Starting mark completion. WB buffering disabled.
[1.922293357 P 1] gcMarkDone flushing 1
[1.922293869 P 3] gcMarkDone flushing 3
[1.922294506 P 0] gcMarkDone flushing 0
... No logs from P 2, but plenty from other Ps
[1.922574537 P 3] deferreturn 0xc000575800 link 0x0 Running defer.
[1.922575525 P 3] wb mark 0xc000575800 Finally we see it actually getting marked. This mark spins and ultimately fails.
... A bunch of stuff on P 0
[1.922863641 P 2] gcMarkDone flushing 2 This flushes P 2's WB buffer, which presumably contains 0xc000575800, but P 3 already marked it above, so it doesn't get marked here, and gcMarkDone considers P 2 to be clean.
[1.922902570 P 1] gcMarkDone flush succeeded But shouldn't have because P 3 has pending work or is dirty.

The high-level sequence of events is:

  1. Object x is white. Suppose there are three P (and thus write barrier buffers). All Ps are clean and have no pending GC work buffers. Write barrier buffer 0 contains x.
  2. P 1 initiates mark completion. Ps 1 and 2 pass the completion barrier successfully.
  3. P 2 performs a write barrier on object x and greys it. (As a result, P 2 now has pending GC work.)
  4. P 0 enters the completion barrier, which flushes its write barrier buffer. Since x was greyed in step 2, this does not grey x. Hence, P 0 passes the completion barrier successfully.

Now all of the Ps have passed the barrier, but in fact P 2 has pending work. If we enter mark termination at this point, there may even be white objects reachable from x.

I'm not sure why this affects defers so disproportionately. It may simply be that defers depend more heavily on write barrier marking than most objects because of their access patterns.

The workaround I committed a couple days ago will fix this problem. I'm still thinking about how to fix it "properly".

I also need to go back to the proof and figure out where reality diverged from the math. I think the core of the problem is that pointers in the write barrier buffer are only "sort of" grey: they haven't been marked yet, so other Ps can't observe their grey-ness). As a result, you wind up with a black-to-"sort-of-grey" pointer, and that isn't strong enough for the proof's assumptions.

@aclements
Copy link
Member

The proof missed a case. It assumed that new GC work could only be created by consuming GC work from a local work queue. But, in fact, the write barrier allows new work to be created even if the local work queue is empty, as long as some queue is non-empty. In particular, if some queue is non-empty, that means some object is grey. Since that's just in the heap, any P can walk from that object to a white object reachable from it and mark that object using a write barrier, causing work to be added to its local queue (which may have been empty). This implicit communication of work isn't tracked by the "flushed" bit in the algorithm/proof, which is what causes the algorithm to fail.

I'm still thinking about how to fix the algorithm. A stronger algorithm could ensure that all work queues were empty at the beginning of the successful completion round. I think a two-round algorithm could do this. However, there's some cost to this since the GC would be stuck in limbo for the whole second round with no work to do, but unable to enter mark termination.

I think a weaker algorithm is still possible where the write barrier sets a global flag if it creates work during termination detection, and we check that flag after the ragged barrier. I haven't convinced myself that this is right yet.

@aclements
Copy link
Member

Since we have a workaround and I'm not going to fix the root cause for 1.12, I'm bumping this to 1.13.

@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@aclements aclements added NeedsFix The path to resolution is known, but the work has not been done. early-in-cycle A change that should be done early in the 3 month dev cycle. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 8, 2019
charlievieth pushed a commit to charlievieth/gostats that referenced this issue Apr 19, 2019
It looks like this may have been hitting the below issue:
  golang/go#27993

These changes appear to mediate it (though not really sure)
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@gopherbot
Copy link

Change https://golang.org/cl/262350 mentions this issue: runtime: remove debugCachedWork

gopherbot pushed a commit that referenced this issue Oct 15, 2020
debugCachedWork and all of its dependent fields and code were added to
aid in debugging issue #27993. Now that the source of the problem is
known and mitigated (via the extra work check after STW in gcMarkDone),
these extra checks are no longer required and simply make the code more
difficult to follow.

Remove it all.

Updates #27993

Change-Id: I594beedd5ca61733ba9cc9eaad8f80ea92df1a0d
Reviewed-on: https://go-review.googlesource.com/c/go/+/262350
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. GarbageCollector NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests