runtime: error message: P has cached GC work at end of mark termination #27993

      1 darwin
      1 js
      1 solaris
      2 android
      2 netbsd
      3 plan9
      4 openbsd
      5 nacl
      8 windows
     19 linux
     23 freebsd

And has shown up on plenty of GOARCHes:

      1 arm64
      1 wasm
      2 amd64p32
      3 mipsle
      5 arm
     24 386
     33 amd64

The failure is usually, though not always, in cmd/go, but that might just be sampling bias, and the exact subcommand that's running during the failure varies.

gopherbot · 2018-11-19T15:43:38Z

Change https://golang.org/cl/149968 mentions this issue: runtime: improve "P has cached GC work" debug info

gopherbot · 2018-11-19T15:43:39Z

Change https://golang.org/cl/149969 mentions this issue: runtime: debug code to catch bad gcWork.puts

bradfitz · 2018-11-20T19:27:25Z

Another on https://go-review.googlesource.com/c/go/+/150517

1 of 19 TryBots failed:
Failed on linux-amd64-race: https://storage.googleapis.com/go-build-log/de8f639f/linux-amd64-race_fc876d28.log

For #27993. Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e Reviewed-on: https://go-review.googlesource.com/c/149968 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

gopherbot · 2018-11-21T16:30:21Z

Change https://golang.org/cl/150778 mentions this issue: Revert "runtime: debug code to catch bad gcWork.puts"

This adds a debug check to throw immediately if any pointers are added to the gcWork buffer after the mark completion barrier. The intent is to catch the source of the cached GC work that occasionally produces "P has cached GC work at end of mark termination" failures. The result should be that we get "throwOnGCWork" throws instead of "P has cached GC work at end of mark termination" throws, but with useful stack traces. This should be reverted before the release. I've been unable to reproduce this issue locally, but this issue appears fairly regularly on the builders, so the intent is to catch it on the builders. This probably slows down the GC slightly. For #27993. Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c Reviewed-on: https://go-review.googlesource.com/c/149969 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

For #27993. Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e Reviewed-on: https://go-review.googlesource.com/c/149968 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

This adds a debug check to throw immediately if any pointers are added to the gcWork buffer after the mark completion barrier. The intent is to catch the source of the cached GC work that occasionally produces "P has cached GC work at end of mark termination" failures. The result should be that we get "throwOnGCWork" throws instead of "P has cached GC work at end of mark termination" throws, but with useful stack traces. This should be reverted before the release. I've been unable to reproduce this issue locally, but this issue appears fairly regularly on the builders, so the intent is to catch it on the builders. This probably slows down the GC slightly. For #27993. Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c Reviewed-on: https://go-review.googlesource.com/c/149969 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

aclements · 2018-11-26T17:06:39Z

I got a little data over the weekend with the added debug code.

$ greplogs -dashboard -E "flushedWork|throwOnGCWork" -md
2018-11-26T14:13:53-9fe9853/darwin-amd64-10_12:

runtime: P 5 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-23T21:21:35-c6d4939/linux-amd64-noopt:

fatal error: throwOnGCWork

runtime stack:
runtime.throw(0x6f386c, 0xd)
	/workdir/go/src/runtime/panic.go:608 +0x72 fp=0x7f034b7fdd48 sp=0x7f034b7fdd18 pc=0x42f4a2
runtime.(*gcWork).putBatch(0xc000021770, 0xc0000217a8, 0x13, 0x200)
	/workdir/go/src/runtime/mgcwork.go:182 +0x1e0 fp=0x7f034b7fdd90 sp=0x7f034b7fdd48 pc=0x424aa0
runtime.wbBufFlush1(0xc000020500)
	/workdir/go/src/runtime/mwbbuf.go:277 +0x1ba fp=0x7f034b7fdde8 sp=0x7f034b7fdd90 pc=0x42b87a
runtime.gcMark(0xa4a16143b5)
	/workdir/go/src/runtime/mgc.go:1932 +0x10b fp=0x7f034b7fde80 sp=0x7f034b7fdde8 pc=0x41e22b
runtime.gcMarkTermination.func1()
	/workdir/go/src/runtime/mgc.go:1501 +0x2a fp=0x7f034b7fde98 sp=0x7f034b7fde80 pc=0x45c57a
runtime.systemstack(0x1)
	/workdir/go/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f034b7fdea0 sp=0x7f034b7fde98 pc=0x45f336
runtime.mstart()
	/workdir/go/src/runtime/proc.go:1153 fp=0x7f034b7fdea8 sp=0x7f034b7fdea0 pc=0x4339c0

goroutine 181 [garbage collection]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:311 fp=0xc000342d60 sp=0xc000342d58 pc=0x45f2c0
runtime.gcMarkTermination(0x3fe295012d50d6c8)
	/workdir/go/src/runtime/mgc.go:1500 +0x178 fp=0xc000342f20 sp=0xc000342d60 pc=0x41d218
runtime.gcMarkDone()
	/workdir/go/src/runtime/mgc.go:1475 +0x168 fp=0xc000342f60 sp=0xc000342f20 pc=0x41cff8
runtime.gcBgMarkWorker(0xc00001e000)
	/workdir/go/src/runtime/mgc.go:1858 +0x294 fp=0xc000342fd8 sp=0xc000342f60 pc=0x41df14
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000342fe0 sp=0xc000342fd8 pc=0x461391
created by runtime.gcBgMarkStartWorkers
	/workdir/go/src/runtime/mgc.go:1679 +0x77

2018-11-21T16:28:17-2a7f904/linux-386-clang:

runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-21T16:28:17-2a7f904/linux-mipsle:

runtime: P 0 flushedWork false wbuf1.n=2 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

aclements · 2018-11-26T17:10:23Z

It's really interesting that in three of the four failures, it did not trip the check on adding work to the gcWork. There's a pretty narrow window when we're still running the termination detection algorithm where work can still be added without triggering this check. Perhaps for debugging we should do a second round after setting throwOnGCWork to see if anything was added during this window.

In the one throwOnGCWork failure, it was flushing a write barrier buffer that apparently had 19 (0x13) unmarked pointers in it. Perhaps for debugging we should disable the write barrier buffers, too, when we set throwOnGCWork.

aclements · 2018-11-26T17:11:41Z

FWIW, I was able to reproduce this once out of 1,334 runs of all.bash on my linux/amd64 box.

##### ../test
# go run run.go -- recover2.go
signal: aborted
runtime: P 5 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

andybons · 2018-11-28T20:13:45Z

@aclements how serious is this and how long do you think it will take to fix (worst question to ask I know)? We’re asking because we may delay the beta due to it so getting more insight into the scope of the issue will help us with that decision. Thanks.

bcmills · 2018-11-28T22:24:34Z

Another repro in https://storage.googleapis.com/go-build-log/f64c1069/windows-386-2008_34d5c37b.log, if it helps:

            > go list -f '{{.ImportPath}}: {{.Match}}' all ... example.com/m/... ./... ./xyz...
            [stderr]
            runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
            fatal error: P has cached GC work at end of mark termination

gopherbot · 2018-12-31T01:27:18Z

Change https://golang.org/cl/156017 mentions this issue: runtime: don't spin in checkPut if non-preemptible

FiloSottile · 2019-01-02T19:32:10Z

@aclements We're assuming we should wait on this for RC1, let us know if you feel otherwise.

Currently it's possible for the runtime to deadlock if checkPut is called in a non-preemptible context. In this case, checkPut may spin, so it won't leave the non-preemptible context, but the thread running gcMarkDone needs to preempt all of the goroutines before it can release the checkPut spin loops. Fix this by returning from checkPut if it's called under any of the conditions that would prevent gcMarkDone from preempting it. In this case, it leaves a note behind that this happened; if the runtime does later detect left-over work it can at least indicate that it was unable to catch it in the act. For #27993. Updates #29385 (may fix it). Change-Id: Ic71c10701229febb4ddf8c104fb10e06d84b122e Reviewed-on: https://go-review.googlesource.com/c/156017 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

mark-rushakoff · 2019-01-03T14:08:04Z

CL 156017 didn't prevent the late gcWork put error (but I think that was expected).

Using go version devel +64096dbb69 Wed Jan 2 21:21:53 2019 +0000 darwin/amd64, and repeatedly running go test -race -run=Task github.com/influxdata/platform/http from https://github.com/influxdata/platform/commit/278f67925cc197f57a32244546fc660df2712e6e (because that was a case where I happened to see a late gcWork put in the past), here's another 20+ stack traces over the course of running that in a loop for about 16 hours.

1546522489-22255.txt
1546521051-27381.txt
1546515641-1038.txt
1546515404-17480.txt
1546510336-4510.txt
1546506868-9166.txt
1546506362-31361.txt
1546504512-19560.txt
1546502232-32045.txt
1546501261-30542.txt
1546499267-31881.txt
1546495909-24024.txt
1546493582-31537.txt
1546493514-32457.txt
1546491250-1139.txt
1546491175-20329.txt
1546490394-25780.txt
1546490041-8027.txt
1546488386-29902.txt
1546484304-6055.txt
1546483677-8850.txt
1546481108-15098.txt
1546478442-22548.txt
1546471442-16184.txt

gopherbot · 2019-01-03T19:52:24Z

Change https://golang.org/cl/156140 mentions this issue: runtime: work around "P has cached GC work" failures

aclements · 2019-01-03T20:33:12Z

@mark-rushakoff, thanks for the pointer to the influx tests. I'll see if I can reproduce it as easily using those.

For reference, I had to install bzr and fetch that package in modules-mode since the HEAD of some of its dependencies is broken:

mkdir /tmp/z
cd /tmp/z
echo "module m" > go.mod
go get -d github.com/influxdata/platform/http@278f679
go test -c -race github.com/influxdata/platform/http

We still don't understand what's causing there to be remaining GC work when we enter mark termination, but in order to move forward on this issue, this CL implements a work-around for the problem. If debugCachedWork is false, this CL does a second check for remaining GC work as soon as it stops the world for mark termination. If it finds any work, it starts the world again and re-enters concurrent mark. This will increase STW time by a small amount proportional to GOMAXPROCS, but fixes a serious correctness issue. This works-around #27993. Change-Id: Ia23b85dd6c792ee8d623428bd1a3115631e387b8 Reviewed-on: https://go-review.googlesource.com/c/156140 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Rick Hudson <rlh@golang.org>

aclements · 2019-01-04T01:27:08Z

CL 156140 has been submitted, which works around this issue, so I'm going to remove release-blocker. We still don't understand the root cause, however, so the issue will remain open.

gopherbot · 2019-01-04T18:09:28Z

Change https://golang.org/cl/156318 mentions this issue: doc/go1.12: remove known issue note

A workaround has been submitted. Updates #27993 Change-Id: Ife6443c32673b38000b90dd2efb2985db37ab773 Reviewed-on: https://go-review.googlesource.com/c/156318 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

aclements · 2019-01-05T16:34:42Z

I think I've figured it out.

I've been running the influx test with a lot of extra runtime debug logging and have been getting some interesting output. Here's an annotated subset of a failure (go-stress-20190104T162318-083562928; I've posted just the last GC cycle since the log is huge).

The unmarked object is 0xc000575800, which is a _defer.

log	notes
[1.915567099 P 1] setGCPhase 1	Start last GC. 0xc000575800 is 2nd in some G's defer list. WB buffer is full size right now, and we only log on flush, so WB logging can be delayed.
...	Lots of stuff on all Ps
[1.922270499 P 2] deferreturn 0xc000602740 link 0xc000575800	Now head of list. `gp._defer = d.link` should have added to WB buf as new value.
[1.922273799 P 2] deferreturn 0xc000575800 link 0x0	Running defer. `gp._defer = d.link` should have added to WB buf as old value.
[1.922274257 P 2] freedefer 0xc000575800	Should have added to WB buf as new value when inserting into pool
[1.922275588 P 2] newdefer from pool 0xc000575800	defer gotten from pool. Again should have added to WB buf
[1.922276023 P 2] newdefer link defer 0x0	Now head of list. Should have added to WB buf
...	No logs from P 2. Notably, no WB marks from P 2.
[1.922292123 P 1] gcMarkDone flush	Starting mark completion. WB buffering disabled.
[1.922293357 P 1] gcMarkDone flushing 1
[1.922293869 P 3] gcMarkDone flushing 3
[1.922294506 P 0] gcMarkDone flushing 0
...	No logs from P 2, but plenty from other Ps
[1.922574537 P 3] deferreturn 0xc000575800 link 0x0	Running defer.
[1.922575525 P 3] wb mark 0xc000575800	Finally we see it actually getting marked. This mark spins and ultimately fails.
...	A bunch of stuff on P 0
[1.922863641 P 2] gcMarkDone flushing 2	This flushes P 2's WB buffer, which presumably contains 0xc000575800, but P 3 already marked it above, so it doesn't get marked here, and gcMarkDone considers P 2 to be clean.
[1.922902570 P 1] gcMarkDone flush succeeded	But shouldn't have because P 3 has pending work or is dirty.

The high-level sequence of events is:

Object x is white. Suppose there are three P (and thus write barrier buffers). All Ps are clean and have no pending GC work buffers. Write barrier buffer 0 contains x.
P 1 initiates mark completion. Ps 1 and 2 pass the completion barrier successfully.
P 2 performs a write barrier on object x and greys it. (As a result, P 2 now has pending GC work.)
P 0 enters the completion barrier, which flushes its write barrier buffer. Since x was greyed in step 2, this does not grey x. Hence, P 0 passes the completion barrier successfully.

Now all of the Ps have passed the barrier, but in fact P 2 has pending work. If we enter mark termination at this point, there may even be white objects reachable from x.

I'm not sure why this affects defers so disproportionately. It may simply be that defers depend more heavily on write barrier marking than most objects because of their access patterns.

The workaround I committed a couple days ago will fix this problem. I'm still thinking about how to fix it "properly".

I also need to go back to the proof and figure out where reality diverged from the math. I think the core of the problem is that pointers in the write barrier buffer are only "sort of" grey: they haven't been marked yet, so other Ps can't observe their grey-ness). As a result, you wind up with a black-to-"sort-of-grey" pointer, and that isn't strong enough for the proof's assumptions.

aclements · 2019-01-07T22:14:21Z

The proof missed a case. It assumed that new GC work could only be created by consuming GC work from a local work queue. But, in fact, the write barrier allows new work to be created even if the local work queue is empty, as long as some queue is non-empty. In particular, if some queue is non-empty, that means some object is grey. Since that's just in the heap, any P can walk from that object to a white object reachable from it and mark that object using a write barrier, causing work to be added to its local queue (which may have been empty). This implicit communication of work isn't tracked by the "flushed" bit in the algorithm/proof, which is what causes the algorithm to fail.

I'm still thinking about how to fix the algorithm. A stronger algorithm could ensure that all work queues were empty at the beginning of the successful completion round. I think a two-round algorithm could do this. However, there's some cost to this since the GC would be stuck in limbo for the whole second round with no work to do, but unable to enter mark termination.

I think a weaker algorithm is still possible where the write barrier sets a global flag if it creates work during termination detection, and we check that flag after the ragged barrier. I haven't convinced myself that this is right yet.

aclements · 2019-01-08T18:36:17Z

Since we have a workaround and I'm not going to fix the root cause for 1.12, I'm bumping this to 1.13.

It looks like this may have been hitting the below issue: golang/go#27993 These changes appear to mediate it (though not really sure)

gopherbot · 2020-10-14T21:24:41Z

Change https://golang.org/cl/262350 mentions this issue: runtime: remove debugCachedWork

debugCachedWork and all of its dependent fields and code were added to aid in debugging issue #27993. Now that the source of the problem is known and mitigated (via the extra work check after STW in gcMarkDone), these extra checks are no longer required and simply make the code more difficult to follow. Remove it all. Updates #27993 Change-Id: I594beedd5ca61733ba9cc9eaad8f80ea92df1a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/262350 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>

ianlancetaylor added GarbageCollector NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Oct 3, 2018

ianlancetaylor added this to the Go1.12 milestone Oct 3, 2018

ianlancetaylor mentioned this issue Oct 8, 2018

runtime: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?) on 386 FreeBSD after CL 138595 #28054

Open

andybons assigned aclements Nov 28, 2018

aclements removed the release-blocker label Jan 4, 2019

aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019

This was referenced Jan 27, 2019

1.12 golang-design/under-the-hood#3

Closed

Update Progress: 1.15 golang-design/under-the-hood#1

Closed

charlievieth pushed a commit to charlievieth/gostats that referenced this issue Apr 19, 2019

sink: cleanup code and make tests faster

c96cdd3

It looks like this may have been hitting the below issue: golang/go#27993 These changes appear to mediate it (though not really sure)

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

rsc unassigned aclements Jun 23, 2022

aclements mentioned this issue Feb 1, 2023

runtime: simplify mark termination and eliminate mark 2 #26903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: error message: P has cached GC work at end of mark termination #27993

runtime: error message: P has cached GC work at end of mark termination #27993

ianlancetaylor commented Oct 3, 2018

aarzilli commented Oct 30, 2018

FiloSottile commented Oct 30, 2018

ianlancetaylor commented Oct 31, 2018

bcmills commented Nov 1, 2018

ianlancetaylor commented Nov 3, 2018

griesemer commented Nov 3, 2018

ianlancetaylor commented Nov 5, 2018

griesemer commented Nov 5, 2018 via email

odeke-em commented Nov 9, 2018

griesemer commented Nov 9, 2018

aclements commented Nov 19, 2018

gopherbot commented Nov 19, 2018

gopherbot commented Nov 19, 2018

bradfitz commented Nov 20, 2018

gopherbot commented Nov 21, 2018

aclements commented Nov 26, 2018

aclements commented Nov 26, 2018

aclements commented Nov 26, 2018 •

edited

andybons commented Nov 28, 2018

bcmills commented Nov 28, 2018

gopherbot commented Dec 31, 2018

FiloSottile commented Jan 2, 2019

mark-rushakoff commented Jan 3, 2019

gopherbot commented Jan 3, 2019

aclements commented Jan 3, 2019

aclements commented Jan 4, 2019

gopherbot commented Jan 4, 2019

aclements commented Jan 5, 2019

aclements commented Jan 7, 2019

aclements commented Jan 8, 2019

gopherbot commented Oct 14, 2020

runtime: error message: P has cached GC work at end of mark termination #27993

runtime: error message: P has cached GC work at end of mark termination #27993

Comments

ianlancetaylor commented Oct 3, 2018

aarzilli commented Oct 30, 2018

FiloSottile commented Oct 30, 2018

ianlancetaylor commented Oct 31, 2018

bcmills commented Nov 1, 2018

ianlancetaylor commented Nov 3, 2018

griesemer commented Nov 3, 2018

ianlancetaylor commented Nov 5, 2018

griesemer commented Nov 5, 2018 via email

odeke-em commented Nov 9, 2018

griesemer commented Nov 9, 2018

aclements commented Nov 19, 2018

gopherbot commented Nov 19, 2018

gopherbot commented Nov 19, 2018

bradfitz commented Nov 20, 2018

gopherbot commented Nov 21, 2018

aclements commented Nov 26, 2018

aclements commented Nov 26, 2018

aclements commented Nov 26, 2018 • edited

andybons commented Nov 28, 2018

bcmills commented Nov 28, 2018

gopherbot commented Dec 31, 2018

FiloSottile commented Jan 2, 2019

mark-rushakoff commented Jan 3, 2019

gopherbot commented Jan 3, 2019

aclements commented Jan 3, 2019

aclements commented Jan 4, 2019

gopherbot commented Jan 4, 2019

aclements commented Jan 5, 2019

aclements commented Jan 7, 2019

aclements commented Jan 8, 2019

gopherbot commented Oct 14, 2020

aclements commented Nov 26, 2018 •

edited