Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: runtime.GC can return without finishing sweep #45315

Closed
aclements opened this issue Mar 31, 2021 · 14 comments
Closed

runtime: runtime.GC can return without finishing sweep #45315

aclements opened this issue Mar 31, 2021 · 14 comments
Milestone

Comments

@aclements
Copy link
Member

aclements commented Mar 31, 2021

runtime.TestGCTestIsReachable has been flaky since I added it. The failures look like:

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
runtime: object 0 was not swept
fatal error: IsReachable failed

goroutine 176934 [running]:
runtime.throw(0x688d74, 0x12)
	/workdir/go/src/runtime/panic.go:1126 +0x74 fp=0xc000276628 sp=0xc0002765f8 pc=0x43e334
runtime.gcTestIsReachable(0xc00108e200, 0x10, 0x10, 0x4)
	/workdir/go/src/runtime/mgc.go:2397 +0x2ad fp=0xc000276690 sp=0xc000276628 pc=0x423a0d
runtime.GCTestIsReachable(...)
	/workdir/go/src/runtime/export_test.go:1256
runtime_test.TestGCTestIsReachable(0xc00090d380)
	/workdir/go/src/runtime/gc_test.go:249 +0x25c fp=0xc000276780 sp=0xc000276690 pc=0x58a8fc
testing.tRunner(0xc00090d380, 0x69c980)
	/workdir/go/src/testing/testing.go:1235 +0x102 fp=0xc0002767d0 sp=0xc000276780 pc=0x4eee02
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1454 +0x1 fp=0xc0002767d8 sp=0xc0002767d0 pc=0x47b861
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x3ab

Some examples:

https://build.golang.org/log/6664ed5c503173b3f29eb85cb093baa61fb04838
https://build.golang.org/log/5608852a242f93bb2f36b6a56bebdefda3797e82
https://build.golang.org/log/231e6d68ccc2b858f8b002b45596356d8beb856c

I'm sure this has something to do with GC scheduling.

I haven't had any luck reproducing this locally. I'm now trying on the gomotes.

gopool create -setup 'gomote push $VM && gomote run $VM go/src/make.bash && gomote run $VM go/bin/go test -c -o ../src/runtime/runtime.test runtime' linux-amd64-sid
stress2 -p 5 gopool run 'gomote run $VM go/src/runtime/runtime.test -test.short -test.cpu=1,2,4'
@aclements aclements added this to the Go1.17 milestone Mar 31, 2021
@aclements aclements self-assigned this Mar 31, 2021
@aclements
Copy link
Member Author

aclements commented Apr 1, 2021

I've added some debuglogs and am running with

gopool create -setup 'gomote push $VM && gomote run $VM go/src/make.bash && gomote run $VM go/bin/go test -c -o ../src/runtime/runtime.test -tags debuglog runtime' linux-amd64-sid
stress2 -p 5 -timeout 20m gopool run 'gomote run $VM go/src/runtime/runtime.test -quick -test.cpu=1,2,4'

Here's a debug log from a test failure:

[405.756910307 P 0] GC STW
[405.756935357 P 0] GC cycle 29670
[405.758808137 P 1] mark term release gcsema
[405.760135621 P 1] gcTest:semacquire
[405.760137842 P 1] gcTest:addSpecial  0xc00028c000
[405.760139981 P 1] gcTest:addSpecial  0xc00028c028
[405.760140720 P 1] gcTest:addSpecial  0xc00028c030
[405.760141367 P 1] gcTest:addSpecial  0xc00028c038
[405.760141901 P 1] gcTest:addSpecial  0xc00028c040
[405.760142377 P 1] gcTest:addSpecial  0xc00028c048
[405.760142987 P 1] gcTest:addSpecial  0xc00028c050
[405.760143558 P 1] gcTest:addSpecial  0xc00028c058
[405.760144140 P 1] gcTest:addSpecial  0xc00028c060
[405.760144730 P 1] gcTest:addSpecial  0xc00028c068
[405.760145278 P 1] gcTest:addSpecial  0xc00028c070
[405.760145665 P 1] gcTest:addSpecial  0xc00028c078
[405.760146214 P 1] gcTest:addSpecial  0xc00028c080
[405.760146700 P 1] gcTest:addSpecial  0xc00028c088
[405.760147205 P 1] gcTest:addSpecial  0xc00028c090
[405.760147760 P 1] gcTest:addSpecial  0xc00028c098
[405.760148314 P 1] gcTest:semrelease
[405.760148930 P 1] GC n= 29670
[405.760149453 P 1] GC:gcStart  29671
[405.760150119 P 1] acquire gcsema
[405.760387061 P 1] GC STW
[405.760405580 P 1] GC cycle 29671
[405.760424513 P 1] GC:gcWaitOnMark  29671
[405.762090357 P 0] GC:sweepone  29671
[405.762322549 P 0] gcTest:check
[405.762383414 P 1] freeSpecial 0xc00028c000 29671
[405.762384649 P 1] freeSpecial 0xc00028c028 29671
[405.762385050 P 1] freeSpecial 0xc00028c030 29671
[405.762385481 P 1] freeSpecial 0xc00028c038 29671
[405.762385777 P 1] freeSpecial 0xc00028c040 29671
[405.762386091 P 1] freeSpecial 0xc00028c048 29671
[405.762386348 P 1] freeSpecial 0xc00028c050 29671
[405.762386697 P 1] freeSpecial 0xc00028c058 29671
[405.762387014 P 1] freeSpecial 0xc00028c060 29671
[405.762387346 P 1] freeSpecial 0xc00028c068 29671
[405.762387692 P 1] freeSpecial 0xc00028c070 29671
[405.762387966 P 1] freeSpecial 0xc00028c078 29671
[405.762388174 P 1] freeSpecial 0xc00028c080 29671
[405.762388718 P 1] freeSpecial 0xc00028c088 29671
[405.762389014 P 1] freeSpecial 0xc00028c090 29671
[405.762389380 P 1] freeSpecial 0xc00028c098 29671

So the test is triggering a GC that ultimately frees the objects, but runtime.GC() is returning before the GC cycle is complete.

@aclements
Copy link
Member Author

Some progress. runtime.GC() is successfully waiting for the mark to complete, but returning before the sweep is complete. We could work around this in IsReachable, but this is actually a bug inruntime.GC().

@aclements
Copy link
Member Author

I think I understand what's going on. Sweeping tracks two different types of completion: "we've swept or started sweeping all spans" (confusingly called mheap_.sweepdone) and "all spans have been swept". GC correctness depends on mheap_.sweepdone and we get that one right, but runtime.GC() and a few other minor things depend on the latter condition, and we get that wrong.

For "all spans have been swept", we track that we've exited all calls to sweepone and there are no more spans that haven't started sweeping, but that's not the only way that spans get swept. Hence, the final sweepone could exit with no more spans in the queue and declare that all spans have been swept, but there could still be concurrent sweeps from other sources working on the last few spans. In TestGCTestIsReachable, the result is that runtime.GC() returns while the span with all the test's objects is actively being swept.

@aclements aclements changed the title runtime: TestGCTestIsReachable flakes runtime: runtime.GC can return without finishing sweep Apr 1, 2021
@bcmills
Copy link
Contributor

bcmills commented Apr 6, 2021

Do you expect to have a fix soon, or can the test be disabled until it is fixed?
I'm concerned that the failure rate on the builders could mask regressions that lead to other low-frequency failures.

@gopherbot
Copy link

Change https://golang.org/cl/307809 mentions this issue: runtime: deflake TestGCTestIsReachable

@aclements
Copy link
Member Author

I just committed a workaround in TestGCTestIsReachable that should stop the test flakes.

gopherbot pushed a commit that referenced this issue Apr 6, 2021
This is a simple workaround for a bug where runtime.GC() can return
before finishing a full sweep, causing gcTestIsReachable to throw. The
right thing is to fix runtime.GC(), but this should get this test
passing reliably in the meantime.

Updates #45315.

Change-Id: Iae141e6dbb26a9c2649497c1feedd4aaeaf540c7
Reviewed-on: https://go-review.googlesource.com/c/go/+/307809
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/307916 mentions this issue: runtime: consolidate "is sweep done" conditions

@gopherbot
Copy link

Change https://golang.org/cl/307915 mentions this issue: runtime: block sweep completion on all sweep paths

gopherbot pushed a commit that referenced this issue Apr 12, 2021
The runtime currently has two different notions of sweep completion:

1. All spans are either swept or have begun sweeping.

2. The sweeper has *finished* sweeping all spans.

Having both is confusing (it doesn't help that the documentation is
often unclear or wrong). Condition 2 is stronger and the theoretical
slight optimization that condition 1 could impact is never actually
useful. Hence, this CL consolidates both conditions down to condition 2.

Updates #45315.

Change-Id: I55c84d767d74eb31a004a5619eaba2e351162332
Reviewed-on: https://go-review.googlesource.com/c/go/+/307916
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@networkimprov
Copy link

Backport candidate?

@aclements
Copy link
Member Author

I'm not sure this is really important enough to backport. It's incredibly difficult to trigger and difficult to observe even if you do trigger it. The most likely user effect is that you may observe a heap profile after calling runtime.GC() that's wrong by a few spans worth of memory.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 9, 2021

As per #46500 (comment) this is unfortunately still possible. I will fix #46500 with an extra runtime.GC call, but this should be resolved.

#46500 (comment) sketches out a fix. I'll re-upload the fix sketch CL too with this issue attached.

@mknyszek mknyszek reopened this Jul 9, 2021
@gopherbot
Copy link

Change https://golang.org/cl/333389 mentions this issue: runtime: fix sweep termination condition

@gopherbot
Copy link

Change https://golang.org/cl/333549 mentions this issue: runtime/pprof: call runtime.GC twice in memory profile test

@mknyszek mknyszek modified the milestones: Go1.17, Go1.18 Jul 9, 2021
gopherbot pushed a commit that referenced this issue Jul 9, 2021
This change fixes #46500 by working around #45315 which may cause freed
objects to get missed in the heap profile published for the test.

By calling runtime.GC one more time this change ensures that all freed
objects are accounted for.

Fixes #46500.

Change-Id: Iedcd0b37dbaffa688b0ff8631a8b79f7a1169634
Reviewed-on: https://go-review.googlesource.com/c/go/+/333549
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@aclements
Copy link
Member Author

@mknyszek , I think your analysis in #46500 (comment) is exactly right. sweepLocker covers the window while the span is locked for sweeping, but that window doesn't include the actual termination check (sweep list is empty), so we can still race.

Combining sweepers and sweepDrained, as you proposed seems like a reasonable solution, though I haven't looked at your code draft yet. I will say that if you make sweepDrained the low bit, then you don't need as many CAS operations.

@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants