Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: latency in sweep assists when there's no garbage #18155

Open
rhysh opened this issue Dec 2, 2016 · 39 comments
Open

runtime: latency in sweep assists when there's no garbage #18155

rhysh opened this issue Dec 2, 2016 · 39 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Dec 2, 2016

go version devel +ae9712c766 Thu Dec 1 22:20:31 2016 +0000 darwin/amd64, and linux/amd64

I wrote a test case for the GC latency described by @WillSewell in https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/ . I recorded the timestamps of when pauses (during allocation) were observed, and matched them against GODEBUG=gctrace=1 output.

The pauses come after the GC cycle, during the concurrent sweep phase. Also, they happen as the heap grows and do not happen once the heap size is stable. And the pauses are often longer than the clock time of any single gctrace line.

A goroutine that needs to allocate memory right after a GC cycle is required to help with the sweep work. This means looking through the heap until suitable memory is found. If the heap is growing rapidly with little or no garbage created, this mutator goroutine will need to search for a long time: it can end up sweeping half of the heap (shared with the background sweep worker).

The GC is pretty good at controlling the mark assist burden on any single goroutine—and starting the GC cycle before the allowable heap growth is exhausted, allowing tradeoffs between assist burden and heap growth during the cycle. These protections don't seem to be in place for sweep assists.

The execution tracer's view of this is confusing since it emits events for each call to runtime.(*mspan).sweep, which is called in a loop from runtime.(*mcentral).cacheSpan. This makes it look like the user goroutine is making a lot of progress on sweep work, since a single call to runtime.mallocgc can lead to thousands of tiny "SWEEP" blocks.

Frustrated assists like this are perhaps more likely to happen in microbenchmarks than in production applications, but the way sweep work is shown in the execution tracer makes it hard to say for sure. This could be a contributing factor to #14812 — one of the goroutines in the recent execution trace spent over 17ms in sweep assists.

The test is below, followed by some example runs.

package gc_test

import (
	"flag"
	"testing"
	"time"
)

var (
	start       = time.Now()
	allocSize   = flag.Int("alloc-size", 1000, "Bytes per allocation")
	retainCount = flag.Int("retain-count", 1000000, "Number of allocations to retain")
	loopCount   = flag.Int("loop-count", 10, "Number of times to execute loop")
	preallocate = flag.Bool("preallocate", false, "Preallocate retention slice")
)

func TestHeapGrowthStalls(t *testing.T) {
	var mem [][]byte
	if *preallocate {
		mem = make([][]byte, *retainCount)
	}

	var slowDelta time.Duration

	for i := 0; i < (*loopCount)*(*retainCount); i++ {
		// Allow the program's heap to start small, instead of pre-allocating
		// the entire slice.
		if len(mem) < *retainCount {
			mem = append(mem, nil)
		}

		j := i % len(mem)
		before := time.Now()
		mem[j] = make([]byte, *allocSize)
		delta := time.Since(before)
		if delta > slowDelta || (delta > time.Millisecond && delta > slowDelta/2) {
			slowDelta = delta
			at := time.Since(start)
			t.Logf("i=%-8d at=%.3fs   delta=%.3fms",
				i,
				at.Seconds(),
				delta.Seconds()*float64(time.Second/time.Millisecond))
		}
	}
}

On a darwin/amd64 laptop:

$ GODEBUG=gctrace=1 /tmp/gc.test -test.v
=== RUN   TestHeapGrowthStalls
gc 1 @0.002s 1%: 0.035+0.14+0.019 ms clock, 0.10+0.047/0.12/0.12+0.059 ms cpu, 4->4->3 MB, 5 MB goal, 8 P
gc 2 @0.003s 2%: 0.003+0.63+0.028 ms clock, 0.024+0.054/0.17/0.14+0.22 ms cpu, 6->7->6 MB, 7 MB goal, 8 P
gc 3 @0.004s 3%: 0.003+0.29+0.034 ms clock, 0.026+0.16/0.16/0.095+0.27 ms cpu, 12->12->12 MB, 13 MB goal, 8 P
gc 4 @0.007s 4%: 0.004+1.2+0.048 ms clock, 0.033+0.94/0.059/0.29+0.39 ms cpu, 23->23->22 MB, 24 MB goal, 8 P
gc 5 @0.012s 3%: 0.005+0.59+0.024 ms clock, 0.044+0.054/0.59/0.58+0.19 ms cpu, 44->44->42 MB, 45 MB goal, 8 P
gc 6 @0.020s 2%: 0.005+5.9+0.033 ms clock, 0.040+0.093/1.1/5.7+0.26 ms cpu, 84->84->80 MB, 85 MB goal, 8 P
gc 7 @0.045s 2%: 0.006+1.7+0.036 ms clock, 0.051+0.25/2.2/2.3+0.29 ms cpu, 156->157->152 MB, 160 MB goal, 8 P
gc 8 @0.077s 1%: 0.006+2.1+0.027 ms clock, 0.055+1.0/2.9/7.3+0.21 ms cpu, 297->300->286 MB, 305 MB goal, 8 P
gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock, 0.070+2.3/4.6/18+0.35 ms cpu, 558->563->537 MB, 573 MB goal, 8 P
gc 10 @0.268s 1%: 0.008+7.5+0.044 ms clock, 0.067+3.7/12/32+0.35 ms cpu, 1047->1056->1004 MB, 1074 MB goal, 8 P
gc 11 @0.428s 1%: 0.008+7.8+0.043 ms clock, 0.070+1.9/13/36+0.34 ms cpu, 1959->1974->1018 MB, 2009 MB goal, 8 P
gc 12 @0.924s 0%: 0.008+8.6+0.17 ms clock, 0.071+1.8/15/38+1.3 ms cpu, 1985->2004->1022 MB, 2036 MB goal, 8 P
gc 13 @1.426s 0%: 0.006+9.0+0.034 ms clock, 0.053+1.8/17/41+0.27 ms cpu, 1993->2000->1010 MB, 2044 MB goal, 8 P
gc 14 @1.658s 0%: 0.008+9.4+0.026 ms clock, 0.070+1.8/15/45+0.21 ms cpu, 1969->1982->1015 MB, 2020 MB goal, 8 P
gc 15 @1.905s 0%: 0.015+9.5+0.075 ms clock, 0.12+1.9/18/41+0.60 ms cpu, 1980->1994->1017 MB, 2031 MB goal, 8 P
gc 16 @2.143s 0%: 0.016+11+0.045 ms clock, 0.12+1.8/21/54+0.36 ms cpu, 1983->1999->1019 MB, 2034 MB goal, 8 P
gc 17 @2.385s 0%: 0.015+10+0.066 ms clock, 0.12+2.4/20/49+0.53 ms cpu, 1987->2002->1018 MB, 2038 MB goal, 8 P
gc 18 @2.616s 0%: 0.015+11+0.077 ms clock, 0.12+2.4/15/59+0.61 ms cpu, 1985->2004->1021 MB, 2036 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (2.85s)
	gc_test.go:42: i=0        at=0.000s   delta=0.000ms
	gc_test.go:42: i=2        at=0.000s   delta=0.001ms
	gc_test.go:42: i=18       at=0.000s   delta=0.002ms
	gc_test.go:42: i=26       at=0.001s   delta=0.002ms
	gc_test.go:42: i=114      at=0.001s   delta=0.003ms
	gc_test.go:42: i=170      at=0.001s   delta=0.004ms
	gc_test.go:42: i=218      at=0.001s   delta=0.014ms
	gc_test.go:42: i=3226     at=0.002s   delta=0.157ms
	gc_test.go:42: i=12016    at=0.005s   delta=0.385ms
	gc_test.go:42: i=22534    at=0.009s   delta=0.408ms
	gc_test.go:42: i=41997    at=0.014s   delta=0.797ms
	gc_test.go:42: i=79541    at=0.028s   delta=1.451ms
	gc_test.go:42: i=152493   at=0.051s   delta=3.489ms
	gc_test.go:42: i=286213   at=0.084s   delta=5.107ms
	gc_test.go:42: i=286229   at=0.087s   delta=2.922ms
	gc_test.go:42: i=536053   at=0.169s   delta=10.281ms
	gc_test.go:42: i=1001709  at=0.295s   delta=20.094ms
PASS

The at=0.169s delta=10.281ms line matches up with gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock. The GC cycle began at 0.154s and completed around 0.158s. The record-breaking allocation delay finished at 0.169s and began at 0.159s.

On a linux/amd64 server:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.045s 11%: 0.11+21+0.072 ms clock, 0.22+20/43/106+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.068s 14%: 0.004+27+0.10 ms clock, 0.036+0.063/46/93+0.87 ms cpu, 459->513->513 MB, 917 MB goal, 8 P
gc 3 @0.172s 10%: 0.019+30+0.084 ms clock, 0.15+0.20/54/135+0.67 ms cpu, 739->785->784 MB, 1027 MB goal, 8 P
gc 4 @0.389s 6%: 0.014+33+0.039 ms clock, 0.11+0.71/64/161+0.31 ms cpu, 1315->1388->1388 MB, 1569 MB goal, 8 P
gc 5 @0.795s 4%: 0.026+33+0.044 ms clock, 0.21+1.6/62/164+0.35 ms cpu, 2486->2557->2557 MB, 2777 MB goal, 8 P
gc 6 @1.589s 3%: 0.025+84+0.046 ms clock, 0.20+2.3/159/406+0.37 ms cpu, 4780->4915->4915 MB, 5115 MB goal, 8 P
gc 7 @3.202s 2%: 0.015+63+0.041 ms clock, 0.12+4.3/126/313+0.32 ms cpu, 9377->9499->9499 MB, 9830 MB goal, 8 P
gc 8 @6.350s 1%: 0.018+204+0.091 ms clock, 0.14+55/395/998+0.72 ms cpu, 18438->18685->18685 MB, 18999 MB goal, 8 P
gc 9 @13.058s 1%: 0.019+300+0.066 ms clock, 0.15+112/565/1496+0.52 ms cpu, 36436->36706->20259 MB, 37370 MB goal, 8 P
gc 10 @34.001s 0%: 0.017+265+0.11 ms clock, 0.13+83/507/1313+0.91 ms cpu, 39506->39732->20215 MB, 40519 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (53.37s)
        gc_test.go:42: i=0        at=0.045s   delta=0.581ms
        gc_test.go:42: i=57055    at=0.097s   delta=2.220ms
        gc_test.go:42: i=331266   at=0.200s   delta=2.323ms
        gc_test.go:42: i=334871   at=0.221s   delta=19.590ms
        gc_test.go:42: i=952975   at=0.454s   delta=32.232ms
        gc_test.go:42: i=2150327  at=0.889s   delta=61.852ms
        gc_test.go:42: i=4564455  at=1.799s   delta=126.545ms
        gc_test.go:42: i=9258703  at=3.533s   delta=268.414ms
        gc_test.go:42: i=18664799 at=7.354s   delta=799.231ms
PASS

/cc @aclements @RLH

@aclements
Copy link
Member

Very interesting. I knew this problem existed for large allocations, but hadn't really thought about it with small allocations.

Assuming I understand what's going on, this isn't really a "sweep assist" problem, since it would happen even if the sweep ratio was 0. Sweep assists don't care if they found more space or not; you get credit for sweeping a span regardless of the outcome. The problem is that allocation ultimately has to find a span with some free space on it and currently it will keep sweeping until it does.

I might have a simple fix: put a threshold on how many spans allocation is willing to sweep before it gives up and just allocates a fresh span from the heap. The heap overhead of this is bounded at 1/threshold.

@aclements
Copy link
Member

There are actually a few distinct causes of this.

  • One is when sweep has to process a lot of full spans from an mcache before finding one that has space to allocate from. My proposed fix targets this, but doesn't actually work because it just shifts the time into the next problem. (All + timestamps below are in ns.)
long cacheSpan @0.286 10ms, 83577 swept, 0 reclaimed, 1 grow
+27 (p 0) lock mcentral
+71 (p 0) locked; scanning lists
+10539832 (p 0) grow()
+10539993 (p 0) mheap.alloc()
+10540090 (p 0) mheap.alloc_m()
+10540149 (p 0) got lock
+10540220 (p 0) reclaim
+10542490 (p 0) reclaim done
+10542536 (p 0) allocSpanLocked
+10542977 (p 0) initialize span
+10543065 (p 0) unlock
+10543110 (p 0) mheap.alloc_m done
+10543166 (p 0) mheap.alloc done
+10543204 (p 0) initSpan
+10543524 (p 0) grow done
+10543588 (p 0) have span
  • When we go to allocate a fresh span, either because of the fallback in my fix or because we're genuinely out of spans of that size class, the heap also tries to sweep. This is not to find individual free objects, but to free up whole spans before allocating a new one and potentially growing the RSS. This is, in a sense, even more aggressive since it's trying to find completely empty spans. I've been aware of this one for a while and have various schemes for how to fix it, but none that are simple. This doesn't happen very often, but when it does it's really expensive.
long cacheSpan @2.890 22ms, 0 swept, 1 reclaimed, 1 grow
+25 (p 0) lock mcentral
+70 (p 0) locked; scanning lists
+121 (p 0) grow()
+263 (p 0) mheap.alloc()
+380 (p 0) mheap.alloc_m()
+480 (p 0) got lock
+525 (p 0) reclaim
+22056552 (p 0) reclaim done
+22056982 (p 0) allocSpanLocked
+22057164 (p 0) initialize span
+22057377 (p 0) unlock
+22057418 (p 0) mheap.alloc_m done
+22057638 (p 0) memclr
+22058327 (p 0) mheap.alloc done
+22058448 (p 0) initSpan
+22058975 (p 0) grow done
+22059177 (p 0) have span

With my proposed fix, we fall into this case any time we would otherwise have fallen into the previous case. For this particular application, the reclaim sweep has just as much trouble finding free space as the mcentral sweep. This may be less true in a more heterogeneous application.

long cacheSpan @0.161 5ms, 50+0 swept, 0 skipped, 1 grow
+27 (p 0) lock mcentral
+66 (p 0) locked; scanning lists
+5336 (p 0) retry failed
+5375 (p 0) grow()
+5425 (p 0) mheap.alloc()
+5485 (p 0) mheap.alloc_m()
+5525 (p 0) got lock
+5556 (p 0) reclaim
+5092799 (p 0) reclaim done
+5093019 (p 0) allocSpanLocked
+5093269 (p 0) initialize span
+5093403 (p 0) unlock
+5093441 (p 0) mheap.alloc_m done
+5093631 (p 0) memclr
+5098431 (p 0) mheap.alloc done
+5098462 (p 0) initSpan
+5100228 (p 0) grow done
+5100424 (p 0) have span
  • Every once in a while when I run the benchmark, I get dozens of ~10ms stalls. When this happens, it's because memclr in mheap.alloc_m takes 5--14ms. This almost certainly has something to do with page faults and kernel state, though I don't know what. I have a hunch it has to do with transparent huge page scavenging (not clearing; clearing a huge page takes 40 us), but that's purely a hunch.
long cacheSpan @1.643 9ms, 0 swept, 0 reclaimed, 1 grow
+28 (p 0) lock mcentral
+74 (p 0) locked; scanning lists
+194 (p 0) grow()
+247 (p 0) mheap.alloc()
+307 (p 0) mheap.alloc_m()
+350 (p 0) got lock
+391 (p 0) allocSpanLocked
+701 (p 0) initialize span
+761 (p 0) unlock
+804 (p 0) mheap.alloc_m done
+843 (p 0) memclr
+9365603 (p 0) mheap.alloc done
+9366045 (p 0) initSpan
+9366973 (p 0) grow done
+9367295 (p 0) have span

@aclements aclements added the NeedsFix The path to resolution is known, but the work has not been done. label Dec 6, 2016
@aclements aclements added this to the Go1.8Maybe milestone Dec 6, 2016
@gopherbot
Copy link

CL https://golang.org/cl/34291 mentions this issue.

@aclements
Copy link
Member

@rhysh, care to give CL 34291 a shot? I don't think that's entirely the right approach, but it definitely makes things better.

@aclements aclements modified the milestones: Go1.9, Go1.8Maybe Dec 13, 2016
@rhysh
Copy link
Contributor Author

rhysh commented Jan 20, 2017

Hi @aclements , yes CL 34291 PS 1 makes things better. It's particularly apparent when the heap is large, and the stalls of over 500ms are reduced to less than 100ms.

Here's the output I got from go1.8rc2 with CL 34291 PS 1 cherry-picked on linux/amd64:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.go18rc2.cl34291ps1.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.043s 13%: 0.17+25+0.070 ms clock, 0.35+23/49/122+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.069s 15%: 0.007+34+0.13 ms clock, 0.057+0.12/57/119+1.1 ms cpu, 459->524->524 MB, 917 MB goal, 8 P
gc 3 @0.182s 11%: 0.008+29+0.054 ms clock, 0.070+0.29/57/144+0.43 ms cpu, 748->800->800 MB, 1048 MB goal, 8 P
gc 4 @0.397s 7%: 0.016+29+0.043 ms clock, 0.13+0.75/56/142+0.34 ms cpu, 1332->1375->1375 MB, 1601 MB goal, 8 P
gc 5 @0.801s 4%: 0.013+33+0.066 ms clock, 0.10+1.5/64/162+0.53 ms cpu, 2482->2555->2555 MB, 2750 MB goal, 8 P
gc 6 @1.604s 3%: 0.013+41+0.033 ms clock, 0.11+2.3/79/202+0.26 ms cpu, 4792->4876->4876 MB, 5111 MB goal, 8 P
gc 7 @3.199s 2%: 0.020+85+0.063 ms clock, 0.16+4.3/158/368+0.50 ms cpu, 9367->9524->9524 MB, 9752 MB goal, 8 P
gc 8 @6.374s 1%: 0.019+149+0.10 ms clock, 0.15+19/280/682+0.80 ms cpu, 18515->18767->18767 MB, 19049 MB goal, 8 P
gc 9 @12.683s 1%: 0.018+169+0.054 ms clock, 0.14+31/334/834+0.43 ms cpu, 36597->36799->20191 MB, 37535 MB goal, 8 P
gc 10 @37.305s 0%: 0.026+190+0.040 ms clock, 0.21+39/363/912+0.32 ms cpu, 39373->39596->20212 MB, 40382 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (55.54s)
	gc_test.go:42: i=0        at=0.043s   delta=0.710ms
	gc_test.go:42: i=68087    at=0.107s   delta=0.870ms
	gc_test.go:42: i=331280   at=0.202s   delta=2.343ms
	gc_test.go:42: i=351240   at=0.230s   delta=1.670ms
	gc_test.go:42: i=940256   at=0.456s   delta=4.094ms
	gc_test.go:42: i=2150448  at=0.898s   delta=7.048ms
	gc_test.go:42: i=4529512  at=1.789s   delta=22.685ms
	gc_test.go:42: i=9294784  at=3.549s   delta=39.208ms
	gc_test.go:42: i=18770456 at=7.122s   delta=88.855ms
PASS

The longest stall, 89ms, started around (7.122s-0.089s = 7.033s), which is not immediately after the previous GC, which ended around (6.374s+0.149s = 6.523s). It's likely still during the concurrent sweep phase (which takes around 560ms, based on the go1.8rc2 run below), but it doesn't start at the beginning of the sweep phase.

Compared with the output from vanilla go1.8rc2:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.go18rc2.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.038s 13%: 0.17+21+0.080 ms clock, 0.35+20/42/105+0.16 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.061s 15%: 0.006+29+0.076 ms clock, 0.055+0.21/49/99+0.61 ms cpu, 459->502->502 MB, 917 MB goal, 8 P
gc 3 @0.171s 10%: 0.009+26+0.051 ms clock, 0.073+0.25/48/121+0.41 ms cpu, 729->785->785 MB, 1005 MB goal, 8 P
gc 4 @0.369s 6%: 0.013+29+0.10 ms clock, 0.10+0.69/57/143+0.83 ms cpu, 1311->1347->1347 MB, 1570 MB goal, 8 P
gc 5 @0.785s 4%: 0.015+48+0.062 ms clock, 0.12+1.6/82/175+0.49 ms cpu, 2441->2506->2505 MB, 2694 MB goal, 8 P
gc 6 @1.594s 2%: 0.011+43+0.060 ms clock, 0.090+2.5/82/211+0.48 ms cpu, 4716->4805->4805 MB, 5011 MB goal, 8 P
gc 7 @3.143s 2%: 0.016+73+0.030 ms clock, 0.13+4.8/145/364+0.24 ms cpu, 9238->9366->9365 MB, 9610 MB goal, 8 P
gc 8 @6.224s 1%: 0.019+139+0.037 ms clock, 0.15+20/267/697+0.30 ms cpu, 18241->18472->18472 MB, 18731 MB goal, 8 P
gc 9 @12.331s 1%: 0.020+152+0.038 ms clock, 0.16+30/302/755+0.30 ms cpu, 36020->36228->20196 MB, 36944 MB goal, 8 P
gc 10 @32.834s 0%: 0.013+209+0.058 ms clock, 0.10+36/397/1053+0.46 ms cpu, 39383->39624->20229 MB, 40393 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (55.93s)
	gc_test.go:42: i=0        at=0.039s   delta=1.323ms
	gc_test.go:42: i=45710    at=0.091s   delta=1.879ms
	gc_test.go:42: i=331286   at=0.195s   delta=2.284ms
	gc_test.go:42: i=335222   at=0.206s   delta=9.855ms
	gc_test.go:42: i=910622   at=0.446s   delta=48.938ms
	gc_test.go:42: i=2097248  at=0.904s   delta=70.692ms
	gc_test.go:42: i=4451669  at=1.777s   delta=140.077ms
	gc_test.go:42: i=9121877  at=3.487s   delta=271.325ms
	gc_test.go:42: i=18446653 at=6.923s   delta=560.603ms
PASS

I'm not aware of any production applications that this issue affects: observing the issue (if I understand it correctly) seems to require the unusual combination of 1) a growing heap, 2) very little garbage to reclaim, and 3) caring about latency in individual goroutines.

@WillSewell
Copy link

WillSewell commented Feb 14, 2017

I'm trying to understand this problem, and there's a few things that are confusing me. In my original benchmark, when I visualise a run of the program with go tool trace, the pauses look like this:

image

i.e. all processors are running a combination of runtime.gcBgMarkWorker and MARK and GC.

When I visualise the benchmark listed at the top of this issue, the pauses look like this:

screen shot 2017-02-14 at 11 56 18

This also has all processors running runtime.gcBgMarkWorker at the start, and processors running MARK at the end, but for a very short time (<0.05ms). But one processor runs runtime.gcBgMarkWorker and one runs GC for relatively much longer.

A couple of things about this confuses me:

  1. Why is the work done during the pauses not quite the same? I appreciate this could just be due to a larger heap size.
  2. From what I can tell (just from the names of the events) that all the work is to do with the mark phase. What about this indicates it's pausing doing sweep assist work? Is it to do with the mysteriously named GC event?
  3. I came to the conclusion that the problem we are seeing in the visualisation is this: runtime: idle mark workers run when there's work to do #16528. Is this not correct? Or am I seeing runtime: idle mark workers run when there's work to do #16528 in the above screenshots, and I'm missing the issue discussed here in the trace visualisations?

Happy to take this offline if someone would be willing to explain this to me.

@gopherbot
Copy link

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@cespare
Copy link
Contributor

cespare commented Feb 24, 2018

I have what I think is an acute case of this problem. (If that's not the case -- apologies; please ignore.)

Over on #23044 I described my current workaround for the "small heap problem" -- where a program generates a lot of garbage but keeps a very small amount of live data, leading to extremely frequent GCs.

As I mentioned in #23044 I'm currently using heap ballast as a horrible workaround to force the garbage collector to set a much higher size target.

The latest system where I added ballast is described in #23044 (comment):

I ran into this again today with another service that was doing 15 collections/sec. This one is a fairly simple webserver that receives 10-15k requests per second and turns them into protobuf messages that are sent along the network to other systems for processing.

So my program does a bunch of small allocations that become garbage right away (its work) and a few giant allocations every once in a while which tend to stay live for a long time (ballast). It seems like sometimes, after more ballast is added, my program runs into the situation that @rhysh described at the top, and all the mutators can spend all their time in sweep assists.

See this unpleasant trace:

screen_20180223180636

In this case no goroutine did productive work for about 600ms.

This particular case isn't a big deal for us -- I can mitigate this in a few different ways -- but I thought it might be interesting as an extreme example.

@aclements
Copy link
Member

aclements commented Feb 24, 2018 via email

@cespare
Copy link
Contributor

cespare commented Feb 28, 2018

Quick question: could you sample a
few of those sweeps and report the metadata associated with them? I'm
particularly interested in their duration, how many pages were swept, and
how many bytes were recovered.

Here are 6 sequential sweeps for one of the procs:

Start 4,116,135,261 ns
Wall Duration 13,317,631 ns
Swept bytes 5201920
Reclaimed bytes	5104656

Start 4,129,453,818 ns
Wall Duration 5,827 ns
Swept bytes 8192
Reclaimed bytes 8064

Start 4,129,479,153 ns
Wall Duration 13,572,150 ns
Swept bytes 6684672
Reclaimed bytes 6553744

Start 4,143,126,160 ns
Wall Duration 13,116,497 ns
Swept bytes 6479872
Reclaimed bytes 6353504

Start 4,156,243,473 ns
Wall Duration 6,180 ns
Swept bytes 8192
Reclaimed bytes 8096

Start 4,156,321,795 ns
Wall Duration 14,072,471 ns
Swept bytes 7069696
Reclaimed bytes 6929616

Also, is there anything between the obvious, long sweeps that are visible
in the screenshot? (E.g., if you select one and hit the right arrow key, is
the next event just the next big sweep or something else?)

Yeah -- as shown above, between the big sweeps there are often interspersed some small sweeps. (But not always -- sometimes there are two big sweeps next to each other.)

I also mailed you the trace (this is private code).

@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018
@danieltahara
Copy link

danieltahara commented Aug 27, 2018

We're writing a MySQL wire-protocol compatible proxy at Dropbox, and it looks like we're hitting this same issue in production on go 1.10.3.

Some "oddities" about the service -- being effectively a MySQL server, there is no connection multiplexing and there is a ton of per-connection buffers and state (see vitessio/vitess#4138). We're working on reducing them, but something like 90% of the in use heap memory at any given time is tied up is unreclaimable, effectively constant memory, and 50+% of the # of in use objects are from the same.

Our traces look like this:

image

Zooming in, the concurrent SWEEP happens on the first call to MakeString(...) (copying cgo memory into a go string). The gap between GC finishing and the SWEEPs is due to the fact that most of the goroutines (one per connection) don't have any data waiting on their sockets, so they don't require any allocations:

image

GC happens every few seconds (GOGC=75, since we don't have much memory to spare), and concurrent mark and scan is itself expensive (and born mostly by mutators), but it still pales in comparison to the concurrent sweep stalls, which seem to tie up every core for about 100-200ms.

gc 62645 @227962.290s 4%: 0.28+223+0.23 ms clock, 5.6+2319/1090/7.2+4.6 ms cpu, 6995->7052->4089 MB, 7146 MB goal, 20 P
gc 62646 @227965.313s 4%: 0.30+243+0.26 ms clock, 6.1+2406/1188/0.97+5.2 ms cpu, 7004->7061->4083 MB, 7156 MB goal, 20 P
gc 62647 @227968.497s 4%: 0.14+239+0.25 ms clock, 2.8+2290/1170/0.024+5.1 ms cpu, 6993->7051->4083 MB, 7145 MB goal, 20 P
gc 62648 @227971.571s 4%: 0.26+251+0.26 ms clock, 5.3+2557/1233/0.29+5.2 ms cpu, 6994->7050->4085 MB, 7146 MB goal, 20 P
gc 62649 @227974.337s 4%: 0.16+242+0.24 ms clock, 3.2+2376/1175/1.8+4.8 ms cpu, 6997->7056->4086 MB, 7148 MB goal, 20 P
gc 62650 @227977.309s 4%: 0.35+291+0.31 ms clock, 7.1+2466/1422/1.3+6.3 ms cpu, 6999->7055->4084 MB, 7151 MB goal, 20 P

@RLH
Copy link
Contributor

RLH commented Aug 29, 2018 via email

@danieltahara
Copy link

The objects are in the range of a few hundred bytes to single digit KB. The best specific data I can get is for row size (as opposed to column size). That distribution (in bytes) is here:

p99 p90 p75 p50
84511.8414031 6485.345 1126.004 379.099280115

@RLH
Copy link
Contributor

RLH commented Aug 29, 2018 via email

@danieltahara
Copy link

danieltahara commented Aug 29, 2018

Queries in last hour, in bytes:

p99 p95 p90 p75 p50 p25 p10
72700.6023324 10662.0823733 6609.87482699 1130.83317386 379.689311459 101.527534113 46.6143729713

To reiterate -- these are stats for row size, not column/value size.

@RLH
Copy link
Contributor

RLH commented Aug 29, 2018 via email

@danieltahara
Copy link

danieltahara commented Aug 29, 2018 via email

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 29, 2018

@RLH Actually, we took another look and looks like MakeString doesn't allocate that much:
https://github.com/dropbox/godropbox/blob/master/database/sqltypes/sqltypes.go#L69
Just structure with interface filed where interface implementation is

type String struct {
	data   []byte
	isUtf8 bool
}

@gopherbot
Copy link

Change https://golang.org/cl/138957 mentions this issue: runtime: record in-use spans in a page-indexed bitmap

gopherbot pushed a commit that referenced this issue Nov 5, 2018
This adds an endpoint to the trace tool that plots the minimum mutator
utilization curve using information on mark assists and GC pauses from
the trace.

This commit implements a fairly straightforward O(nm) algorithm for
computing the MMU (and tests against an even more direct but slower
algorithm). Future commits will extend and optimize this algorithm.

This should be useful for debugging and understanding mutator
utilization issues like #14951, #14812, #18155. #18534, #21107,
particularly once follow-up CLs add trace cross-referencing.

Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a
Reviewed-on: https://go-review.googlesource.com/c/60790
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
gopherbot pushed a commit that referenced this issue Nov 15, 2018
Currently, there's no efficient way to iterate over the Go heap. We're
going to need this for fast free page sweeping, so this CL adds a
slice of all allocated heap arenas. This will also be useful for
generational GC.

For #18155.

Change-Id: I58d126cfb9c3f61b3125d80b74ccb1b2169efbcc
Reviewed-on: https://go-review.googlesource.com/c/138076
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Nov 15, 2018
This adds a bitmap indexed by page number that marks the starts of
in-use spans. This will be used to quickly find in-use spans with no
marked objects for sweeping.

For #18155.

Change-Id: Icee56f029cde502447193e136fa54a74c74326dd
Reviewed-on: https://go-review.googlesource.com/c/138957
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Nov 15, 2018
This adds a mark bit for each span that is set if any objects on the
span are marked. This will be used for sweeping.

For #18155.

The impact of this is negligible for most benchmarks, and < 1% for
GC-heavy benchmarks.

name                       old time/op  new time/op  delta
Garbage/benchmem-MB=64-12  2.18ms ± 0%  2.20ms ± 1%  +0.88%  (p=0.000 n=16+18)

(https://perf.golang.org/search?q=upload:20180928.1)

name                      old time/op    new time/op    delta
BinaryTree17-12              2.68s ± 1%     2.68s ± 1%    ~     (p=0.707 n=17+19)
Fannkuch11-12                2.28s ± 0%     2.39s ± 0%  +4.95%  (p=0.000 n=19+18)
FmtFprintfEmpty-12          40.3ns ± 4%    39.4ns ± 2%  -2.27%  (p=0.000 n=17+18)
FmtFprintfString-12         67.9ns ± 1%    68.3ns ± 1%  +0.55%  (p=0.000 n=18+19)
FmtFprintfInt-12            75.7ns ± 1%    76.1ns ± 1%  +0.44%  (p=0.005 n=18+19)
FmtFprintfIntInt-12          123ns ± 1%     121ns ± 1%  -1.00%  (p=0.000 n=18+18)
FmtFprintfPrefixedInt-12     150ns ± 0%     148ns ± 0%  -1.33%  (p=0.000 n=16+13)
FmtFprintfFloat-12           208ns ± 0%     204ns ± 0%  -1.92%  (p=0.000 n=13+17)
FmtManyArgs-12               501ns ± 1%     498ns ± 0%  -0.55%  (p=0.000 n=19+17)
GobDecode-12                6.24ms ± 0%    6.25ms ± 1%    ~     (p=0.113 n=20+19)
GobEncode-12                5.33ms ± 0%    5.29ms ± 1%  -0.72%  (p=0.000 n=20+18)
Gzip-12                      220ms ± 1%     218ms ± 1%  -1.02%  (p=0.000 n=19+19)
Gunzip-12                   35.5ms ± 0%    35.7ms ± 0%  +0.45%  (p=0.000 n=16+18)
HTTPClientServer-12         77.9µs ± 1%    77.7µs ± 1%  -0.30%  (p=0.047 n=20+19)
JSONEncode-12               8.82ms ± 0%    8.93ms ± 0%  +1.20%  (p=0.000 n=18+17)
JSONDecode-12               47.3ms ± 0%    47.0ms ± 0%  -0.49%  (p=0.000 n=17+18)
Mandelbrot200-12            3.69ms ± 0%    3.68ms ± 0%  -0.25%  (p=0.000 n=19+18)
GoParse-12                  3.13ms ± 1%    3.13ms ± 1%    ~     (p=0.640 n=20+20)
RegexpMatchEasy0_32-12      76.2ns ± 1%    76.2ns ± 1%    ~     (p=0.818 n=20+19)
RegexpMatchEasy0_1K-12       226ns ± 0%     226ns ± 0%  -0.22%  (p=0.001 n=17+18)
RegexpMatchEasy1_32-12      71.9ns ± 1%    72.0ns ± 1%    ~     (p=0.653 n=18+18)
RegexpMatchEasy1_1K-12       355ns ± 1%     356ns ± 1%    ~     (p=0.160 n=18+19)
RegexpMatchMedium_32-12      106ns ± 1%     106ns ± 1%    ~     (p=0.325 n=17+20)
RegexpMatchMedium_1K-12     31.1µs ± 2%    31.2µs ± 0%  +0.59%  (p=0.007 n=19+15)
RegexpMatchHard_32-12       1.54µs ± 2%    1.53µs ± 2%  -0.78%  (p=0.021 n=17+18)
RegexpMatchHard_1K-12       46.0µs ± 1%    45.9µs ± 1%  -0.31%  (p=0.025 n=17+19)
Revcomp-12                   391ms ± 1%     394ms ± 2%  +0.80%  (p=0.000 n=17+19)
Template-12                 59.9ms ± 1%    59.9ms ± 1%    ~     (p=0.428 n=20+19)
TimeParse-12                 304ns ± 1%     312ns ± 0%  +2.88%  (p=0.000 n=20+17)
TimeFormat-12                318ns ± 0%     326ns ± 0%  +2.64%  (p=0.000 n=20+17)

(https://perf.golang.org/search?q=upload:20180928.2)

Change-Id: I336b9bf054113580a24103192904c8c76593e90e
Reviewed-on: https://go-review.googlesource.com/c/138958
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Nov 15, 2018
When we attempt to allocate an N page span (either for a large
allocation or when an mcentral runs dry), we first try to sweep spans
to release N pages. Currently, this can be extremely expensive:
sweeping a span to emptiness is the hardest thing to ask for and the
sweeper generally doesn't know where to even look for potentially
fruitful results. Since this is on the critical path of many
allocations, this is unfortunate.

This CL changes how we reclaim empty spans. Instead of trying lots of
spans and hoping for the best, it uses the newly introduced span marks
to efficiently find empty spans. The span marks (and in-use bits) are
in a dense bitmap, so these spans can be found with an efficient
sequential memory scan. This approach can scan for unmarked spans at
about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We
could probably significantly improve the rate at which is can free
unmarked spans, but that's a separate issue.

Like the current reclaimer, this is still linear in the number of
spans that are swept, but the constant factor is now so vanishingly
small that it doesn't matter.

The benchmark in #18155 demonstrates both significant page reclaiming
delays, and object reclaiming delays. With "-retain-count=20000000
-preallocate=true -loop-count=3", the benchmark demonstrates several
page reclaiming delays on the order of 40ms. After this change, the
page reclaims are insignificant. The longest sweeps are still ~150ms,
but are object reclaiming delays. We'll address those in the next
several CLs.

Updates #18155.

Fixes #21378 by completely replacing the logic that had that bug.

Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064
Reviewed-on: https://go-review.googlesource.com/c/138959
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
bradfitz pushed a commit that referenced this issue Nov 21, 2018
Currently, there's no efficient way to iterate over the Go heap. We're
going to need this for fast free page sweeping, so this CL adds a
slice of all allocated heap arenas. This will also be useful for
generational GC.

For #18155.

Change-Id: I58d126cfb9c3f61b3125d80b74ccb1b2169efbcc
Reviewed-on: https://go-review.googlesource.com/c/138076
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
bradfitz pushed a commit that referenced this issue Nov 21, 2018
This adds a bitmap indexed by page number that marks the starts of
in-use spans. This will be used to quickly find in-use spans with no
marked objects for sweeping.

For #18155.

Change-Id: Icee56f029cde502447193e136fa54a74c74326dd
Reviewed-on: https://go-review.googlesource.com/c/138957
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
bradfitz pushed a commit that referenced this issue Nov 21, 2018
This adds a mark bit for each span that is set if any objects on the
span are marked. This will be used for sweeping.

For #18155.

The impact of this is negligible for most benchmarks, and < 1% for
GC-heavy benchmarks.

name                       old time/op  new time/op  delta
Garbage/benchmem-MB=64-12  2.18ms ± 0%  2.20ms ± 1%  +0.88%  (p=0.000 n=16+18)

(https://perf.golang.org/search?q=upload:20180928.1)

name                      old time/op    new time/op    delta
BinaryTree17-12              2.68s ± 1%     2.68s ± 1%    ~     (p=0.707 n=17+19)
Fannkuch11-12                2.28s ± 0%     2.39s ± 0%  +4.95%  (p=0.000 n=19+18)
FmtFprintfEmpty-12          40.3ns ± 4%    39.4ns ± 2%  -2.27%  (p=0.000 n=17+18)
FmtFprintfString-12         67.9ns ± 1%    68.3ns ± 1%  +0.55%  (p=0.000 n=18+19)
FmtFprintfInt-12            75.7ns ± 1%    76.1ns ± 1%  +0.44%  (p=0.005 n=18+19)
FmtFprintfIntInt-12          123ns ± 1%     121ns ± 1%  -1.00%  (p=0.000 n=18+18)
FmtFprintfPrefixedInt-12     150ns ± 0%     148ns ± 0%  -1.33%  (p=0.000 n=16+13)
FmtFprintfFloat-12           208ns ± 0%     204ns ± 0%  -1.92%  (p=0.000 n=13+17)
FmtManyArgs-12               501ns ± 1%     498ns ± 0%  -0.55%  (p=0.000 n=19+17)
GobDecode-12                6.24ms ± 0%    6.25ms ± 1%    ~     (p=0.113 n=20+19)
GobEncode-12                5.33ms ± 0%    5.29ms ± 1%  -0.72%  (p=0.000 n=20+18)
Gzip-12                      220ms ± 1%     218ms ± 1%  -1.02%  (p=0.000 n=19+19)
Gunzip-12                   35.5ms ± 0%    35.7ms ± 0%  +0.45%  (p=0.000 n=16+18)
HTTPClientServer-12         77.9µs ± 1%    77.7µs ± 1%  -0.30%  (p=0.047 n=20+19)
JSONEncode-12               8.82ms ± 0%    8.93ms ± 0%  +1.20%  (p=0.000 n=18+17)
JSONDecode-12               47.3ms ± 0%    47.0ms ± 0%  -0.49%  (p=0.000 n=17+18)
Mandelbrot200-12            3.69ms ± 0%    3.68ms ± 0%  -0.25%  (p=0.000 n=19+18)
GoParse-12                  3.13ms ± 1%    3.13ms ± 1%    ~     (p=0.640 n=20+20)
RegexpMatchEasy0_32-12      76.2ns ± 1%    76.2ns ± 1%    ~     (p=0.818 n=20+19)
RegexpMatchEasy0_1K-12       226ns ± 0%     226ns ± 0%  -0.22%  (p=0.001 n=17+18)
RegexpMatchEasy1_32-12      71.9ns ± 1%    72.0ns ± 1%    ~     (p=0.653 n=18+18)
RegexpMatchEasy1_1K-12       355ns ± 1%     356ns ± 1%    ~     (p=0.160 n=18+19)
RegexpMatchMedium_32-12      106ns ± 1%     106ns ± 1%    ~     (p=0.325 n=17+20)
RegexpMatchMedium_1K-12     31.1µs ± 2%    31.2µs ± 0%  +0.59%  (p=0.007 n=19+15)
RegexpMatchHard_32-12       1.54µs ± 2%    1.53µs ± 2%  -0.78%  (p=0.021 n=17+18)
RegexpMatchHard_1K-12       46.0µs ± 1%    45.9µs ± 1%  -0.31%  (p=0.025 n=17+19)
Revcomp-12                   391ms ± 1%     394ms ± 2%  +0.80%  (p=0.000 n=17+19)
Template-12                 59.9ms ± 1%    59.9ms ± 1%    ~     (p=0.428 n=20+19)
TimeParse-12                 304ns ± 1%     312ns ± 0%  +2.88%  (p=0.000 n=20+17)
TimeFormat-12                318ns ± 0%     326ns ± 0%  +2.64%  (p=0.000 n=20+17)

(https://perf.golang.org/search?q=upload:20180928.2)

Change-Id: I336b9bf054113580a24103192904c8c76593e90e
Reviewed-on: https://go-review.googlesource.com/c/138958
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
bradfitz pushed a commit that referenced this issue Nov 21, 2018
When we attempt to allocate an N page span (either for a large
allocation or when an mcentral runs dry), we first try to sweep spans
to release N pages. Currently, this can be extremely expensive:
sweeping a span to emptiness is the hardest thing to ask for and the
sweeper generally doesn't know where to even look for potentially
fruitful results. Since this is on the critical path of many
allocations, this is unfortunate.

This CL changes how we reclaim empty spans. Instead of trying lots of
spans and hoping for the best, it uses the newly introduced span marks
to efficiently find empty spans. The span marks (and in-use bits) are
in a dense bitmap, so these spans can be found with an efficient
sequential memory scan. This approach can scan for unmarked spans at
about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We
could probably significantly improve the rate at which is can free
unmarked spans, but that's a separate issue.

Like the current reclaimer, this is still linear in the number of
spans that are swept, but the constant factor is now so vanishingly
small that it doesn't matter.

The benchmark in #18155 demonstrates both significant page reclaiming
delays, and object reclaiming delays. With "-retain-count=20000000
-preallocate=true -loop-count=3", the benchmark demonstrates several
page reclaiming delays on the order of 40ms. After this change, the
page reclaims are insignificant. The longest sweeps are still ~150ms,
but are object reclaiming delays. We'll address those in the next
several CLs.

Updates #18155.

Fixes #21378 by completely replacing the logic that had that bug.

Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064
Reviewed-on: https://go-review.googlesource.com/c/138959
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@aclements
Copy link
Member

Go 1.12 fixed a major source of this latency. While there are still allocation patterns that can in theory lead to long sweep latency, we haven't seen these in our internal tests. Does anybody on this issue still see long sweep latency with Go 1.12? If not, I'll go ahead and call this resolved.

@un000
Copy link

un000 commented Apr 2, 2019

@aclements Some pauses still happen when one goroutine allocates a huge block of memory. Is this relevant to the latest changes?

We have a cache inside an application. The cache consists of structs:

type Rates struct {
	a float64
	b float64
	c float64
	d float64
	e float64
}

type Cache struct {
	index []int
	// like map[positionFromindex][geo(static 255 items)]*Rates
        // size: 3M rows with 255 items
	rates [][]*Rates
}

It's ok, but have a lot of pointers(3M). To reduce them we made the slice of slices single-dimensioned because geo is statically defined with 255 items:

type Rates struct {
	Initialized bool
	a float64
	b float64
	c float64
	d float64
	e float64
}
type Cache struct {
	index []int
	rates []Rates
}

And when the application allocates all the slice (3M*255) in the background *the latency inside all response handlers increases *

	c.rates = make([]FixedRates, size)

Nothing changed between 1.11 and 1.12

@un000
Copy link

un000 commented Apr 2, 2019

Got an example:

package main

import (
	"fmt"
	"math/rand"
	"os"
	"runtime/trace"
	"time"
)

type Rates struct {
	a, b, c, d float64
}

var maxWait time.Duration

func main() {
	const size = 100e6

	out, err := os.OpenFile("prof", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		panic(err)
	}
	defer func() {
		_ = out.Sync()
		_ = out.Close()
	}()
	err = trace.Start(out)
	if err != nil {
		panic(err)
	}

	go loop()

	time.Sleep(3 * time.Second)
	go func() {
		_ = make([]Rates, size)		// << huge alloc
	}()
	time.Sleep(3 * time.Second)

	trace.Stop()

	fmt.Println("maxWait =", maxWait.String())
}

func loop() {
	for {
		now := time.Now()
		r := make([]Rates, 0)
		for i := 0; i < 100; i++ {
			r = append(r, Rates{
				rand.Float64(), rand.Float64(), rand.Float64(), rand.Float64(),
			})
		}
		d := time.Since(now)
		if maxWait < d {
			maxWait = d
		}
	}
}
~ ᐅ go version
go version go1.12.1 darwin/amd64

image

So in production we have a bigger pauses

@danieltahara
Copy link

Our upgrade cycle takes a few months. So feel free to close if no one else surfaces problems, and we can revisit this thread if our 1.12 upgrade reveals anything further.

@aclements
Copy link
Member

@un000, thanks for the reproducer. That's actually a different issue, since it's not sweep assists causing the delay but rather sweep termination. Could you open a new issue? I started digging into it and found that it's taking a long time to actually stop the world (rather than anything that's happening during sweep termination), but it wasn't obvious what was delaying it by so much.

@un000
Copy link

un000 commented Apr 3, 2019

@aclements got it #31222

@WillSewell
Copy link

When I run the original benchmark originally mentioned by @rhysh, I consistently get pause times of ~15ms with go 1.12.

@gopherbot
Copy link

Change https://golang.org/cl/187817 mentions this issue: runtime: bound small object sweeping to 100 spans

@gopherbot
Copy link

Change https://golang.org/cl/229998 mentions this issue: runtime: bound small object sweeping to 100 spans when allocating

gopherbot pushed a commit that referenced this issue Apr 27, 2020
Currently, the small object sweeper will sweep until it finds a free
slot or there are no more spans of that size class to sweep. In dense
heaps, this can cause sweeping for a given size class to take
unbounded time, and gets worse with larger heaps.

This CL limits the small object sweeper to try at most 100 spans
before giving up and allocating a fresh span. Since it's already shown
that 100 spans are completely full at that point, the space overhead
of this fresh span is at most 1%.

This CL is based on an experimental CL by Austin Clements (CL 187817)
and is updated to be part of the mcentral implementation, gated by
go115NewMCentralImpl.

Updates #18155.

Change-Id: I37a72c2dcc61dd6f802d1d0eac3683e6642b6ef8
Reviewed-on: https://go-review.googlesource.com/c/go/+/229998
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Austin Clements <austin@google.com>
xujianhai666 pushed a commit to xujianhai666/go-1 that referenced this issue May 21, 2020
Currently, the small object sweeper will sweep until it finds a free
slot or there are no more spans of that size class to sweep. In dense
heaps, this can cause sweeping for a given size class to take
unbounded time, and gets worse with larger heaps.

This CL limits the small object sweeper to try at most 100 spans
before giving up and allocating a fresh span. Since it's already shown
that 100 spans are completely full at that point, the space overhead
of this fresh span is at most 1%.

This CL is based on an experimental CL by Austin Clements (CL 187817)
and is updated to be part of the mcentral implementation, gated by
go115NewMCentralImpl.

Updates golang#18155.

Change-Id: I37a72c2dcc61dd6f802d1d0eac3683e6642b6ef8
Reviewed-on: https://go-review.googlesource.com/c/go/+/229998
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

10 participants