Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: CPU bound goroutines cause unnecessary timer latency #38860

Closed
ChrisHines opened this issue May 4, 2020 · 29 comments
Closed

runtime: CPU bound goroutines cause unnecessary timer latency #38860

ChrisHines opened this issue May 4, 2020 · 29 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@ChrisHines
Copy link
Contributor

ChrisHines commented May 4, 2020

What version of Go are you using (go version)?

$ go version

go version go1.14.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using?

  • macOS 10.14.6 (Darwin Kernel Version 18.7.0 x86_64) with 2.8 GHz Intel Core i7 (4 cores w/ Hyper-Threading for 8 CPUs)
  • CentOS Linux release 7.4.1708 (Core) (Linux 3.10.0-693.21.1.el7.x86_64) with Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz (28 cores w/ Hyper-Threading for 56 CPUs)

What did you do?

Run the following test built with Go 1.13.9 and Go 1.14.2:

package timertest

import (
    "context"
    "fmt"
    "runtime"
    "runtime/trace"
    "strconv"
    "sync"
    "testing"
    "time"
)

func TestTimerScheduling(t *testing.T) {
    const (
        highFreq  = time.Millisecond
        highDur   = highFreq / 10
        highGrace = highFreq * 4

        lowFreq  = time.Second
        lowDur   = lowFreq / 50
        lowGrace = lowFreq * 2
    )

    gmp := runtime.GOMAXPROCS(0)

    testCases := []struct {
        name       string
        doHighFreq bool
        doLowFreq  bool
    }{
        {"lowFreq", false, true},
        {"highFreq", true, false},
        {"mixedFreq", true, true},
    }

    for _, tc := range testCases {
        t.Run(tc.name, func(t *testing.T) {
            ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
            defer cancel()

            var wg sync.WaitGroup
            for i := 0; i < gmp; i++ {
                if tc.doHighFreq {
                    wg.Add(1)
                    go func(i int) {
                        defer wg.Done()
                        timedWork(ctx, t, fmt.Sprintf("highFreq-%d", i), highFreq, highDur, highGrace)
                    }(i)
                }

                if tc.doLowFreq && i < gmp/4 {
                    wg.Add(1)
                    go func(i int) {
                        defer wg.Done()
                        timedWork(ctx, t, fmt.Sprintf("lowFreq-%d", 0), lowFreq, lowDur, lowGrace)
                    }(i)
                }
            }
            wg.Wait()
        })
    }
}

func timedWork(ctx context.Context, t *testing.T, name string, interval, dur, grace time.Duration) {
    timer := time.NewTimer(interval)
    defer timer.Stop()

    now := time.Now()
    expectedWakeup := now.Add(interval)
    deadline := now.Add(grace)

    overdueCount := 0
    var latest time.Duration

    done := ctx.Done()
    for {
        runCtx, task := trace.NewTask(ctx, name)

        trace.Log(runCtx, "expectedWakeup", expectedWakeup.String())
        trace.Log(runCtx, "deadline", deadline.String())

        sleepRgn := trace.StartRegion(runCtx, "sleep")

        select {
        case <-timer.C:
            sleepRgn.End()

            if overdue := time.Since(deadline); overdue > 0 {
                trace.Log(runCtx, "sleepOverdueMicros", strconv.FormatInt(int64(overdue/time.Microsecond), 10))
                overdueCount++
                if overdue > latest {
                    latest = overdue
                }
            }
            sleepDur := time.Until(expectedWakeup)
            for sleepDur <= 0 {
                workRgn := trace.StartRegion(runCtx, "doWork")
                doWork(dur)
                workRgn.End()

                expectedWakeup = expectedWakeup.Add(interval)
                deadline = deadline.Add(interval)
                sleepDur = time.Until(expectedWakeup)
            }
            timer.Reset(sleepDur)

            task.End()

        case <-done:
            sleepRgn.End()

            if overdueCount > 0 {
                t.Errorf("%s: wakeup overdue %v times, latest: %v", name, overdueCount, latest)
            }
            task.End()
            return
        }
    }
}

func doWork(dur time.Duration) {
    start := time.Now()
    for time.Since(start) < dur {
    }
}

What did you expect to see?

All tests pass on both versions of Go.

What did you see instead?

On the 56 CPU Linux server with nothing else of significance running:

$ GOMAXPROCS=8 ./go1.13-late-timer-test.test -test.v
=== RUN   TestTimerScheduling
=== RUN   TestTimerScheduling/lowFreq
=== RUN   TestTimerScheduling/highFreq
=== RUN   TestTimerScheduling/mixedFreq
--- PASS: TestTimerScheduling (15.02s)
    --- PASS: TestTimerScheduling/lowFreq (5.00s)
    --- PASS: TestTimerScheduling/highFreq (5.00s)
    --- PASS: TestTimerScheduling/mixedFreq (5.02s)
PASS
$ GOMAXPROCS=8 ./go1.14-late-timer-test.test -test.v
=== RUN   TestTimerScheduling
=== RUN   TestTimerScheduling/lowFreq
=== RUN   TestTimerScheduling/highFreq
=== RUN   TestTimerScheduling/mixedFreq
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-5: wakeup overdue 1 times, latest: 5.560066ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-7: wakeup overdue 1 times, latest: 10.296257ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-0: wakeup overdue 1 times, latest: 7.205679ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-6: wakeup overdue 1 times, latest: 10.195891ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-1: wakeup overdue 2 times, latest: 5.554398ms
--- FAIL: TestTimerScheduling (15.02s)
    --- PASS: TestTimerScheduling/lowFreq (5.02s)
    --- PASS: TestTimerScheduling/highFreq (5.00s)
    --- FAIL: TestTimerScheduling/mixedFreq (5.00s)
FAIL

I get similar results on macOS although we don't deploy the application on that platform.

$ go1.13.9 test -v
=== RUN   TestTimerScheduling
=== RUN   TestTimerScheduling/lowFreq
=== RUN   TestTimerScheduling/highFreq
=== RUN   TestTimerScheduling/mixedFreq
--- PASS: TestTimerScheduling (15.05s)
    --- PASS: TestTimerScheduling/lowFreq (5.03s)
    --- PASS: TestTimerScheduling/highFreq (5.00s)
    --- PASS: TestTimerScheduling/mixedFreq (5.02s)
PASS
ok      late-timer-test        15.052s
$ go test -v
=== RUN   TestTimerScheduling
=== RUN   TestTimerScheduling/lowFreq
=== RUN   TestTimerScheduling/highFreq
=== RUN   TestTimerScheduling/mixedFreq
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-5: wakeup overdue 1 times, latest: 3.310788ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-7: wakeup overdue 2 times, latest: 9.31654ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-4: wakeup overdue 3 times, latest: 9.350977ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-6: wakeup overdue 3 times, latest: 10.29899ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-2: wakeup overdue 1 times, latest: 10.362095ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-1: wakeup overdue 2 times, latest: 9.594361ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-0: wakeup overdue 2 times, latest: 9.607532ms
--- FAIL: TestTimerScheduling (15.04s)
    --- PASS: TestTimerScheduling/lowFreq (5.02s)
    --- PASS: TestTimerScheduling/highFreq (5.00s)
    --- FAIL: TestTimerScheduling/mixedFreq (5.02s)
FAIL

Additional Background

I initially discovered this problem while testing a Go 1.14 build of our application to see if we could upgrade to the newest version of Go on the project. This is the same application described in #28808 and in my Gophercon 2019 talk: Death by 3,000 Timers. Unfortunately I discovered that this regression in Go 1.14 forces us to stay with Go 1.13 for that application.

Note that our application can tolerate late wake ups as long as they are not too late. The grace period of four intervals on the high frequency timer implements that idea. I derived the constants in the test from our real world requirements, rounded off to keep the numbers simple.

The mixedFreq test case above best models our application's work load. It spends most of its time sending UDP packets on the network at a target bitrate (the high frequency work) but must also perform some heavier tasks to prepare each new stream of UDP packets (the low frequency work).

I added the test cases with homogenous workloads to verify that the issue isn't a fundamental performance issue, but rather an emergent behavior of the scheduler.

Analysis

As you can see, the test code above uses runtime/trace annotations. We have similar annotations in our production code. Analyzing execution traces from either source leads to the same conclusion. The runtime scheduler in Go 1.14 introduces unnecessary timer latency for this kind of application by not stealing timers from non-preemptable P's as described in the following code comment.

go/src/runtime/proc.go

Lines 2164 to 2171 in 96745b9

// Consider stealing timers from p2.
// This call to checkTimers is the only place where
// we hold a lock on a different P's timers.
// Lock contention can be a problem here, so avoid
// grabbing the lock if p2 is running and not marked
// for preemption. If p2 is running and not being
// preempted we assume it will handle its own timers.
if i > 2 && shouldStealTimers(p2) {

Evidence

Here is an execution trace from the Linux host:

$ GOMAXPROCS=8 ./go1.14-late-timer-test.test -test.v -test.trace go1.14.trace -test.run=/mixedFreq
=== RUN   TestTimerScheduling
=== RUN   TestTimerScheduling/mixedFreq
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-5: wakeup overdue 2 times, latest: 8.667704ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-7: wakeup overdue 1 times, latest: 2.17004ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-6: wakeup overdue 1 times, latest: 1.411159ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-2: wakeup overdue 2 times, latest: 2.403263ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-1: wakeup overdue 1 times, latest: 6.429298ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-0: wakeup overdue 2 times, latest: 11.728868ms
    TestTimerScheduling/mixedFreq: late-timer-test_test.go:114: highFreq-4: wakeup overdue 1 times, latest: 11.563101ms
--- FAIL: TestTimerScheduling (5.00s)
    --- FAIL: TestTimerScheduling/mixedFreq (5.00s)
FAIL

go1.14.trace.zip

If you open the tracer UI for the trace and look at the "User-defined tasks" page you see that most of the timings for the highFreq work fall in the two buckets straddling the 1ms frequency it runs at, but there are a small number of outliers on the high side as seen below.

Screen Shot 2020-05-04 at 2 53 26 PM

If we drill into the outlier in the 15ms bucket we see this trace log that shows the task did not wake up until more than 11ms after it's deadline, which is more than 14ms past when the timer was set to fire. After it finally woke up it had to do a lot of work to make up for the missed events. In our real application we cannot afford to miss more than four events like this. Go 1.13 achieves that goal far more consistently, as we will see later.

When	Elapsed	Goroutine ID	Events
4.006214439s	16.86265ms		Task 31866 (complete)
4.006214439	 .         	9	task highFreq-0 (id 31866, parent 0) created
4.006225415	 .    10976	9	expectedWakeup=2020-05-04 18:46:13.026124086 +0000 UTC m=+4.009840522
4.006231304	 .     5889	9	deadline=2020-05-04 18:46:13.029124086 +0000 UTC m=+4.012840522
4.006234152	 .     2848	9	region sleep started (duration: 15.045051ms)
4.021282883	 . 15048731	9	sleepOverdueMicros=11728
4.021285795	 .     2912	9	region doWork started (duration: 103.362µs)
4.021391909	 .   106114	9	region doWork started (duration: 102.722µs)
4.021497255	 .   105346	9	region doWork started (duration: 102.753µs)
4.021602600	 .   105345	9	region doWork started (duration: 102.658µs)
4.021707818	 .   105218	9	region doWork started (duration: 102.69µs)
4.021813100	 .   105282	9	region doWork started (duration: 102.722µs)
4.021918446	 .   105346	9	region doWork started (duration: 102.593µs)
4.022023663	 .   105217	9	region doWork started (duration: 102.69µs)
4.022128913	 .   105250	9	region doWork started (duration: 102.626µs)
4.022234291	 .   105378	9	region doWork started (duration: 102.722µs)
4.022339733	 .   105442	9	region doWork started (duration: 103.041µs)
4.022445398	 .   105665	9	region doWork started (duration: 102.818µs)
4.022550840	 .   105442	9	region doWork started (duration: 102.658µs)
4.022656090	 .   105250	9	region doWork started (duration: 102.786µs)
4.022761500	 .   105410	9	region doWork started (duration: 102.657µs)
4.022866717	 .   105217	9	region doWork started (duration: 102.498µs)
4.022971551	 .   104834	9	region doWork started (duration: 102.658µs)
4.023077089	 .   105538	9	task end
GC:0s

From the above trace it is useful to take note of the goroutine ID (9) and that the sleep started at 4.006234152 seconds into the trace. Opening the "View trace" page from the trace UI home page and navigating to that time we can see what happened in the program to cause the timer latency and that the scheduler missed many opportunities to do a better job.

(click the image for full resolution)
Screen Shot 2020-05-04 at 3 43 21 PM

In the above screen shot G9 goes to sleep just before G10 is scheduled on the same P. G10 monopolizes the P for several milliseconds before yielding and only then does the scheduler check the timers and schedule G9 again, albeit very late.

The arrow labeled "another late wakeup" pointing to G11 shows another example. In this case G12 runs long enough to get preempted by the sysmon thread and allow G11 time to run, also late. But it also shows that another P steals G12 almost immediately to complete its processing.

Throughout this time there are plenty of idle P's available to do work, but the scheduler rules prevent them them from stealing the available expired timers.

Comparison to Go 1.13

Here is a trace file using the Go 1.13 build and collected the same way.

go1.13.trace.zip

The task histograms do not have the same outliers. It is also noteworthy that all of the highFreq goroutines performed exactly 5,000 sleeps during the 5 second trace, an indication that the inner loop never had perform any catch up iterations.

Screen Shot 2020-05-04 at 4 03 13 PM

The trace timeline shows goroutines sleeping just before a longer task but getting picked up by a different P on schedule.

(click the image for full resolution)
Screen Shot 2020-05-04 at 4 05 51 PM

Summary

The new timer implementation in Go 1.14 does a great job reducing the overhead of servicing the timer queues, but the decision not to steal timers from P's not marked for preemption introduces opportunities for timer latencies of several milliseconds even when there are idle P's available to do the work.

We would like the runtime to give goroutines waiting on an expired timer the same priority for work stealing as runnable goroutines already in the run queue. Prior versions of Go exhibited that property because the timer queue was serviced by a goroutine on par with all other goroutines.


cc @ianlancetaylor

Also, could this behavior be part of what's going on in #36298?

@ianlancetaylor
Copy link
Contributor

Thanks for the very detailed analysis. I think this has been fixed on tip by https://golang.org/cl/216198. Would it be possible to see if you get better results when running tip rather than 1.14? Thanks.

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 4, 2020
@ianlancetaylor ianlancetaylor added this to the Go1.15 milestone May 4, 2020
@ChrisHines
Copy link
Contributor Author

The above test fails on tip as of 4c003f6 built just now. I am not sure I understand how https://go-review.googlesource.com/c/go/+/216198/ would help since it fixed an issue with goroutines returned by the poller, but the test code above does not involve the poller that I can see.

@ianlancetaylor
Copy link
Contributor

Well, it was a more general fix really, but thanks for testing it.

@gopherbot
Copy link

Change https://golang.org/cl/232199 mentions this issue: runtime: steal timers from running P's

@ianlancetaylor
Copy link
Contributor

The obvious fix would be something like https://golang.org/cl/232199. Your test fails for me on my system with both 1.13 and tip, but with this change the delays seem to be smaller with tip. Can you see how that change does on your system? Thanks.

@ChrisHines
Copy link
Contributor Author

Yes, I have a fix similar to yours staged here going through OSS contribution review that I hope I can share soon.

The test passes consistently with your change on my linux server (10 out of 10 tries) and passes with roughly the same frequency as Go 1.13 does on my Mac (9 out of 10 for both just now). More definitively the execution trace shows the timers not getting ignored while the longer task runs.

Here is what it looks like. I used the search bar to highlight all instances of one G and you can see around the 2,000ms mark that it gets stollen from the first P to the second P and then on its next wakeup gets stollen by the third P, both times while its former P runs a longer task. After two more iterations it ends up on the last P for several ticks in a row.

Screen Shot 2020-05-04 at 9 32 52 PM

Here is one of the timing histograms.

Screen Shot 2020-05-04 at 9 48 45 PM

This change makes a clear improvement.

@gopherbot
Copy link

Change https://golang.org/cl/232298 mentions this issue: runtime: reduce timer latency

@ChrisHines
Copy link
Contributor Author

My attempt to fix this issue is in https://golang.org/cl/232298. @ianlancetaylor, it may have less overhead than your first draft because it only calls checkTimers once per P in each attempt at work stealing.

@ChrisHines
Copy link
Contributor Author

Also, I copied your sentence referencing the CL that reduced timer lock contention into my commit message. I was wondering when that optimization was added, but didn't know how to find it.

@ianlancetaylor
Copy link
Contributor

@aclements @mknyszek There are two slightly different CLs to address this issue, in https://golang.org/cl/232199 and https://golang.org/cl/232298. Any opinions on the right approach here? Thanks.

@aclements
Copy link
Member

https://golang.org/cl/232199 (which I just +2d) seems much simpler and lower risk than https://golang.org/cl/232298. I'm not clear what the benefits of https://golang.org/cl/232298's approach are.

@ChrisHines
Copy link
Contributor Author

FWIW, I've been working on this issue for the last couple weeks, and have found the issue is not fully resolved by either of the two CL's linked above. I will have an update to my CL with a more comprehensive solution soon that includes benchmarks to demonstrate the benefit.

@ChrisHines
Copy link
Contributor Author

Brief explanation of why neither CL fixes this issue: They both still have scenarios in which a timer can starve while there are idle P's until sysmon wakes up and checks timers, but for applications that typically have idle P's most of the time, sysmon will only wake up every 10ms. 10ms timer latency is a significant regression from previous versions of Go and insufficient for our application.

I believe the update to my CL I plan to post soon eliminates the scenarios in which a timer can starve while there are idle P's.

@ianlancetaylor
Copy link
Contributor

Thanks for looking into this. Note that at this stage of the release process any change to the runtime has to be very minimal and very safe.

@gopherbot
Copy link

Change https://golang.org/cl/235037 mentions this issue: runtime: reduce timer latency

@ChrisHines
Copy link
Contributor Author

@ianlancetaylor I understand that this may not make it for Go 1.15. I was a bit surprised when it was added to that milestone given the issue was filed right at code freeze time.

That does not change my motivation to get this resolved. If it gets held for Go 1.16, so be it.

Also, I got my wires crossed updating my CL. The link in the preceding comment is for the wrong CL (my mistake), but I did get the proper CL updated eventually. So for anyone reading this, the proper CL is https://golang.org/cl/232298.

@networkimprov
Copy link

@ChrisHines said the fix that was pushed doesn't resolve his issue, so I think this should be reopened?

Also, should the new fix be backported to 1.14?

@networkimprov
Copy link

Chris, it would be helpful to have the analysis you posted to Gerrit on May 22nd copied into this thread. It's hard to find things in Gerrit. Thanks!

@ChrisHines
Copy link
Contributor Author

ChrisHines commented Jun 3, 2020

Sure @networkimprov. Here is a copy/paste of that from https://golang.org/cl/232298:

Patch set 4 is a more comprehensive fix for issue #38860. I converted the test in patch set 3 to a benchmark and added an additional benchmark.

I spent a lot of time studying the history of proc.go and its predecessors proc1.go and proc.c to do my best to maintain the proper invariants. I found the workings of sysmon particularly mysterious in the beginning. I had a basic idea what it was for, but didn't understand (and maybe I still don't) the choices behind when to reset the idle and delay variables. Reading through the history it seems to me that not reseting them when waking for a timer is appropriate, but I would be happy to learn more about that if anyone has any references I can look at. I added a comment block in sysmon to try and help future readers avoid having to dig through the history to understand the rules sysmon has but won't profess that I got that all right, please double check that carefully. The commit message on 156e8b3 in particular helped me understand that part of the scheduler so I tried to capture those ideas the best I could.

I view the choice of whether to steal timers before or after looking at runnext as flexible. It seems intuitive to me to steal timers first and I tried to describe my rationale in the comment there, but would happily restore it to checking timers last if someone makes the case for it.

Adding the call to wakep in wakeNetPoller is necessary to avoid starving timers if there is no thread in the netpoller to wake. It should also have the same or less cost as the call to netpollBreak in the other half of the function.

Here are the results of running the timer related benchmarks from the time package and the new benchmarks added here with Go 1.13.9, 1.14.2 and this patch all on the same machine back to back. All benchmarks were run with -count=10

 name \ time/op                                    go1.13.9time.bench  go1.14.2time.bench  gofixtime.bench
 AfterFunc-8                                              2.94ms ± 2%         0.13ms ± 2%      0.12ms ± 4%
 After-8                                                  3.04ms ± 2%         0.20ms ± 2%      0.19ms ± 4%
 Stop-8                                                    162µs ± 9%          161µs ± 9%       153µs ±11%
 SimultaneousAfterFunc-8                                   328µs ± 3%          264µs ± 2%       253µs ± 2%
 StartStop-8                                              64.2µs ±13%         68.4µs ± 9%      63.2µs ± 2%
 Reset-8                                                  15.4µs ±10%         13.5µs ± 2%       9.2µs ± 3%
 Sleep-8                                                   471µs ± 1%          309µs ± 5%       303µs ± 3%
 Ticker-8                                                 7.87ms ± 2%         0.06ms ± 9%      0.05ms ± 6%
 ParallelTimerLatency-8                                    198µs ±10%        17029µs ± 4%       254µs ± 2%
 StaggeredTickerLatency/workDur300µs/8-tickers-8          95.7µs ± 4%         53.9µs ±21%     255.4µs ± 7%
 StaggeredTickerLatency/workDur300µs/16-tickers-8         68.4µs ± 6%        204.9µs ±12%     160.2µs ±24%
 StaggeredTickerLatency/workDur300µs/24-tickers-8         63.8µs ± 6%       1167.6µs ± 0%     111.5µs ± 2%
 StaggeredTickerLatency/workDur300µs/32-tickers-8         58.0µs ±10%       2424.5µs ± 0%     100.6µs ± 3%
 StaggeredTickerLatency/workDur300µs/40-tickers-8         56.9µs ± 5%       3368.3µs ± 1%      85.4µs ± 8%
 StaggeredTickerLatency/workDur300µs/48-tickers-8         56.0µs ±13%       2815.5µs ± 2%      65.3µs ± 7%
 StaggeredTickerLatency/workDur300µs/56-tickers-8         55.1µs ± 7%       2594.0µs ± 4%      58.3µs ± 8%
 StaggeredTickerLatency/workDur300µs/64-tickers-8         52.5µs ± 6%       2723.5µs ± 3%      58.2µs ± 6%
 StaggeredTickerLatency/workDur300µs/72-tickers-8         57.4µs ± 7%       3232.7µs ± 2%      64.0µs ±10%
 StaggeredTickerLatency/workDur300µs/80-tickers-8         7.77ms ± 6%         8.23ms ±21%      4.25ms ±17%
 StaggeredTickerLatency/workDur2ms/8-tickers-8            95.8µs ±12%       1020.1µs ± 1%     196.7µs ± 1%

From these results we can see that Go 1.14 significantly improved the existing time package benchmarks compared to Go 1.13, but it also got much worse on the latency benchmarks I have provided. This patch does not recover all of the latency that was lost, but most of it.

My anecdotal observation here is that it appears the Linux netpoller implementation has more latency when waking after a timeout than the Go 1.13 timerproc implementation. Most of these benchmark numbers replicate on my Mac laptop, but the darwin netpoller seems to suffer less of that particular latency by comparison, and is also worse in other ways. So it may not be possible to close the gap with Go 1.13 purely in the scheduler code. Relying on the netpoller for timers changes the behavior as well, but these new numbers are at least in the same order of magnitude as Go 1.13.

Somewhat surprisingly the the existing benchmarks improved somewhat over Go 1.14.2. I don't know if that's a result of this patch or other improvements to the runtime or compiler since 1.14.2. Here is just the 1.14.2 -> patch comparison

 name                                              old time/op  new time/op   delta
 AfterFunc-8                                        129µs ± 2%    124µs ± 4%    -3.90%  (p=0.004 n=10+10)
 After-8                                            204µs ± 2%    191µs ± 4%    -6.31%  (p=0.000 n=9+10)
 Stop-8                                             161µs ± 9%    153µs ±11%      ~     (p=0.052 n=10+10)
 SimultaneousAfterFunc-8                            264µs ± 2%    253µs ± 2%    -4.37%  (p=0.000 n=10+10)
 StartStop-8                                       68.4µs ± 9%   63.2µs ± 2%    -7.55%  (p=0.002 n=10+10)
 Reset-8                                           13.5µs ± 2%    9.2µs ± 3%   -31.74%  (p=0.000 n=10+10)
 Sleep-8                                            309µs ± 5%    303µs ± 3%    -2.13%  (p=0.043 n=10+10)
 Ticker-8                                          55.3µs ± 9%   54.1µs ± 6%      ~     (p=0.165 n=10+10)
 ParallelTimerLatency-8                            17.0ms ± 4%    0.3ms ± 2%   -98.51%  (p=0.000 n=10+8)
 StaggeredTickerLatency/workDur300µs/8-tickers-8   53.9µs ±21%  255.4µs ± 7%  +374.11%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/16-tickers-8   205µs ±12%    160µs ±24%   -21.83%  (p=0.000 n=9+10)
 StaggeredTickerLatency/workDur300µs/24-tickers-8  1.17ms ± 0%   0.11ms ± 2%   -90.45%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/32-tickers-8  2.42ms ± 0%   0.10ms ± 3%   -95.85%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/40-tickers-8  3.37ms ± 1%   0.09ms ± 8%   -97.46%  (p=0.000 n=9+9)
 StaggeredTickerLatency/workDur300µs/48-tickers-8  2.82ms ± 2%   0.07ms ± 7%   -97.68%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/56-tickers-8  2.59ms ± 4%   0.06ms ± 8%   -97.75%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/64-tickers-8  2.72ms ± 3%   0.06ms ± 6%   -97.86%  (p=0.000 n=9+10)
 StaggeredTickerLatency/workDur300µs/72-tickers-8  3.23ms ± 2%   0.06ms ±10%   -98.02%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/80-tickers-8  8.23ms ±21%   4.25ms ±17%   -48.40%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur2ms/8-tickers-8     1.02ms ± 1%   0.20ms ± 1%   -80.72%  (p=0.000 n=10+10)
 
 name                                              old maxNS    new maxNS     delta
 ParallelTimerLatency-8                             18.2M ± 0%     0.5M ± 4%   -97.26%  (p=0.000 n=9+9)
 StaggeredTickerLatency/workDur300µs/8-tickers-8     135k ±16%    1139k ± 7%  +742.18%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/16-tickers-8   2.89M ± 5%    1.26M ± 7%   -56.32%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/24-tickers-8   6.06M ± 5%    0.88M ±45%   -85.42%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/32-tickers-8   9.25M ± 4%    0.81M ±34%   -91.21%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/40-tickers-8   11.8M ± 5%     0.8M ±33%   -93.39%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/48-tickers-8   11.6M ± 5%     0.8M ±55%   -93.45%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/56-tickers-8   12.0M ± 8%     0.8M ±41%   -93.33%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/64-tickers-8   11.9M ± 7%     0.9M ±18%   -92.50%  (p=0.000 n=10+9)
 StaggeredTickerLatency/workDur300µs/72-tickers-8   11.9M ± 5%     0.9M ±32%   -92.43%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur300µs/80-tickers-8   31.7M ±17%    24.1M ±23%   -24.04%  (p=0.000 n=10+10)
 StaggeredTickerLatency/workDur2ms/8-tickers-8      2.11M ± 1%    1.09M ± 3%   -48.32%  (p=0.000 n=10+9)

I don't know why the 300µs/8-tickers case contradicts the rest of the trend. That's something to investigate.

With regards to issue #25471 that Ian expressed concern about making worse. I get these results using the methodology described in that issue:

Go 1.13.9

$ strace -f -c ./go13sleep 100
strace: Process 38459 attached
strace: Process 38460 attached
strace: Process 38461 attached
strace: Process 38462 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------

 87.79    0.049621          76       656       102 futex
  8.95    0.005058          39       131           nanosleep
  0.81    0.000459           4       114           rt_sigaction
  0.67    0.000381          11        34           sched_yield
  0.43    0.000241          17        14           rt_sigprocmask
  0.37    0.000208          21        10           sigaltstack
  0.31    0.000177          20         9           gettid
  0.20    0.000114          23         5           arch_prctl
  0.14    0.000078          20         4           clone
  0.12    0.000069           9         8           mmap
  0.12    0.000068          23         3           fcntl
  0.09    0.000049          49         1           readlinkat
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1         1 openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.056523                   992       103 total

Go 1.14.2

$ strace -f -c ./go14sleep 100
strace: Process 39013 attached
strace: Process 39014 attached
strace: Process 39015 attached
strace: Process 39016 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------

 59.15    0.032858          65       504       101 futex
 21.75    0.012084          61       199           epoll_pwait
 15.92    0.008841          44       200           nanosleep
  1.04    0.000575           5       114           rt_sigaction
  0.41    0.000227          16        14           rt_sigprocmask
  0.34    0.000188          21         9           gettid
  0.32    0.000177           9        20           mmap
  0.30    0.000167          17        10           sigaltstack
  0.20    0.000110          22         5           arch_prctl
  0.11    0.000062          16         4           clone
  0.10    0.000056          56         1           getpid
  0.09    0.000049          49         1           readlinkat
  0.08    0.000043          14         3           fcntl
  0.03    0.000019          19         1         1 openat
  0.03    0.000019          19         1           epoll_create1
  0.03    0.000016          16         1           execve
  0.03    0.000015          15         1           epoll_ctl
  0.03    0.000014          14         1           tgkill
  0.02    0.000009           9         1           pipe2
  0.01    0.000008           8         1           rt_sigreturn
  0.01    0.000007           7         1           sched_getaffinity
  0.01    0.000005           5         1           uname
------ ----------- ----------- --------- --------- ----------------
100.00    0.055549                  1093       102 total

Patch set 4

$ strace -f -c ./gofixsleep 100
strace: Process 39028 attached
strace: Process 39029 attached
strace: Process 39030 attached
strace: Process 39031 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------

 40.69    0.018274          62       297        89 futex
 29.90    0.013429          97       139           nanosleep
 25.57    0.011482          57       200           epoll_pwait
  1.56    0.000701           6       114           rt_sigaction
  0.49    0.000222          16        14           rt_sigprocmask
  0.41    0.000184          20         9           gettid
  0.41    0.000183          18        10           sigaltstack
  0.27    0.000123          31         4           clone
  0.24    0.000106          21         5           arch_prctl
  0.14    0.000061          61         1           readlinkat
  0.12    0.000054          18         3           fcntl
  0.08    0.000036          36         1           epoll_create1
  0.06    0.000027           1        19           mmap
  0.04    0.000017          17         1           pipe2
  0.02    0.000008           8         1           epoll_ctl
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1         1 openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.044907                   822        90 total

Not a watershed improvement, but somewhat better and certainly no worse. I believe most of the improvement here is due to not forcing sysmon back into high frequency mode after each wakeup from a timer.

@ChrisHines
Copy link
Contributor Author

Also, please reopen this issue. The CL that claims to have fixed it is only a partial fix.

@ianlancetaylor
Copy link
Contributor

Sorry, that was my mistake. I meant to update the CL description, but forgot.

@ianlancetaylor ianlancetaylor reopened this Jun 3, 2020
@networkimprov
Copy link

@gopherbot please backport to 1.14. This is a regression with a simple fix. (It may have a more complex fix, too, but that won't land until 1.16.)

@gopherbot
Copy link

Backport issue(s) opened: #39383 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@odeke-em
Copy link
Member

odeke-em commented Jun 4, 2020

Kindly chiming in to ask if we should re-milestone this for Go1.15 as Ian's CL landed, and he re-opened this issue, but I thought perhaps we'll complete that work in Go1.16?

@ianlancetaylor
Copy link
Contributor

Thanks, this is for 1.16 now.

@nvanbenschoten
Copy link
Contributor

CockroachDB upgraded to Go 1.14 about three weeks ago, and since then, we've seen what appears to be some concerning fallout from this issue. In a few of our nightly tests running on GCP VMs with 4 vCPUs, we push machines up to around 80%-85% of their compute capacity. We've run this kind of test for years without issue. Since upgrading to Go 1.14, we've seen that in this state, tickers can be starved for excessively long durations, orders of magnitude longer than has been discussed thus far in this issue.

For context, each node in a CockroachDB cluster heartbeats a centralized "liveness" record to coordinate a notion of availability. This heartbeating is driven by a loop that uses a ticker to fire once every 4.5 seconds. Upon firing, the expiration of the node's liveness record is extended by 9 seconds. Our testing in cockroachdb/cockroach#50865 is showing that once a machine is sufficiently loaded (~85% CPU utilization), this heartbeat loop can stall for long enough to result in a node's liveness lapsing. Specifically, we see cases where the ticker which is configured to fire every 4.5 seconds takes over 10 seconds to fire. This is easily reproducible on Go 1.14 and not at all reproducible on Go 1.13.

I explored patching 9b90491 into the Go 1.14 runtime and observed that it did help, but did not eliminate the problem entirely. Instead of our tests failing reliably every time, they only fail about a third of the time. This corroborates @ChrisHines's findings – the patch results in a clear improvement but is only a partial fix.

These findings may also bring into question whether we consider this to be a performance regression or a bug, and by extension, whether a backport for the partial fix is warranted (#39383). I sympathize with the sentiment that this issue is revealing a performance regression in a construct that provides no hard performance guarantees. And yet, starvation on the order of 10s of seconds seems entirely different than that on the order of 10s of milliseconds.

@ianlancetaylor
Copy link
Contributor

@nvanbenschoten Thanks for the report. As we are very close to the 1.15 release at this point, it's unfortunately too late to fix this for 1.15.

I think it would be OK to backport https://golang.org/cl/232199 to the 1.14 branch if that makes things better.

Are your goroutine largely CPU bound? Does it change anything if you add an occasional call to runtime.Gosched, or is that too expensive?

@nvanbenschoten
Copy link
Contributor

As we are very close to the 1.15 release at this point, it's unfortunately too late to fix this for 1.15.

Understood.

I think it would be OK to backport https://golang.org/cl/232199 to the 1.14 branch if that makes things better.

We would appreciate that. While it's not a complete fix, it does make a measurable improvement.

Are your goroutine largely CPU bound? Does it change anything if you add an occasional call to runtime.Gosched, or is that too expensive?

In this test, yes. But I'm surprised they are running for this long without returning to the scheduler. I'm trying to get a handle on whether I can narrow down which goroutine is causing issues. If I can, I'll play around with manually yielding on occasion and see if that clears up the starvation.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 4, 2020
Reverts most of cockroachdb#50671.

We upgraded from Go1.13.9 to Go1.14.4 on June 29th in cockroachdb#50671. Since then
we've seen a fair amount of fallout in our testing and build system. Most
of this has been reasonable to manage.

However, we recently stumbled upon a concerning regression in the Go runtime
that doesn't look easy to work around. Go1.14 included a full rewrite of the
runtime timer implementation. This was mostly a positive change, but shortly
after the 1.14 release, reports started coming in of multi-millisecond timer
starvation. This is tracked in this upstream issue:
golang/go#38860. A few months later, when we upgraded
Cockroach to use 1.14, some of our tests started hitting the same issue, but
in our tests, the starvation was much more severe. In
cockroachdb#50865, we saw instances of
multi-second timer starvation, which wreaked havoc on the cluster's liveness
mechanisms.

A partial fix to this runtime issue has landed for Go 1.15
(https://go-review.googlesource.com/c/go/+/232199/) and may be backported to
1.14, but there is currently no ETA on the backport and no guarantee that it
will even happen. A more robust fix is scheduled for 1.16 and the PR is still
open in the Go repo (https://go-review.googlesource.com/c/go/+/232298/).

As we approach the CockroachDB 20.2 stability period, it seems prudent to avoid
the risk of downgrading our Go runtime version too late. For that reason, Alex,
Andy K, Peter, and I recommend the following course of actions:

* Downgrade 20.2 to 1.13 immediately, so we go into the stability period with the right version of Go.
* Schedule the subsequent improvements to non-preemptable loops in CockroachDB itself in 21.1.
* At the beginning of the 21.1 cycle upgrade CRDB to 1.15 and effectively skip 1.14 all together.

This PR addresses the first of these steps. While here, it also picks up the
latest patch release of Go 1.13, moving from 1.13.9 to 1.13.14.

Checklist:

* [X] Adjust version in Docker image ([source](./builder/Dockerfile#L199-L200)).
* [X] Rebuild and push the Docker image (following [Basic Process](#basic-process))
* [X] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)).
* [X] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release.
* [X] Bump the go version in `go.mod`. You may also need to rerun `make vendor_rebuild` if vendoring has changed.
* [X] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh#L40-42)).
* [X] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`).
* [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects.

(I will do the last step after this has merged)

Release note (general change): Revert the Go version back to 1.13.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 7, 2020
Reverts most of cockroachdb#50671.

We upgraded from Go1.13.9 to Go1.14.4 on June 29th in cockroachdb#50671. Since then
we've seen a fair amount of fallout in our testing and build system. Most
of this has been reasonable to manage.

However, we recently stumbled upon a concerning regression in the Go runtime
that doesn't look easy to work around. Go1.14 included a full rewrite of the
runtime timer implementation. This was mostly a positive change, but shortly
after the 1.14 release, reports started coming in of multi-millisecond timer
starvation. This is tracked in this upstream issue:
golang/go#38860. A few months later, when we upgraded
Cockroach to use 1.14, some of our tests started hitting the same issue, but
in our tests, the starvation was much more severe. In
cockroachdb#50865, we saw instances of
multi-second timer starvation, which wreaked havoc on the cluster's liveness
mechanisms.

A partial fix to this runtime issue has landed for Go 1.15
(https://go-review.googlesource.com/c/go/+/232199/) and may be backported to
1.14, but there is currently no ETA on the backport and no guarantee that it
will even happen. A more robust fix is scheduled for 1.16 and the PR is still
open in the Go repo (https://go-review.googlesource.com/c/go/+/232298/).

As we approach the CockroachDB 20.2 stability period, it seems prudent to avoid
the risk of downgrading our Go runtime version too late. For that reason, Alex,
Andy K, Peter, and I recommend the following course of actions:

* Downgrade 20.2 to 1.13 immediately, so we go into the stability period with the right version of Go.
* Schedule the subsequent improvements to non-preemptable loops in CockroachDB itself in 21.1.
* At the beginning of the 21.1 cycle upgrade CRDB to 1.15 and effectively skip 1.14 all together.

This PR addresses the first of these steps. While here, it also picks up the
latest patch release of Go 1.13, moving from 1.13.9 to 1.13.14.

Checklist:

* [X] Adjust version in Docker image ([source](./builder/Dockerfile#L199-L200)).
* [X] Rebuild and push the Docker image (following [Basic Process](#basic-process))
* [X] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)).
* [X] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release.
* [X] Bump the go version in `go.mod`. You may also need to rerun `make vendor_rebuild` if vendoring has changed.
* [X] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh#L40-42)).
* [X] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`).
* [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects.

(I will do the last step after this has merged)

Release note (general change): Revert the Go version back to 1.13.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Aug 7, 2020
52348: build: downgrade to Go 1.13 r=nvanbenschoten a=nvanbenschoten

Reverts most of #50671.

We upgraded from Go1.13.9 to Go1.14.4 on June 29th in #50671. Since then we've seen a fair amount of fallout in our testing and build system. Most of this has been reasonable to manage.

However, we recently stumbled upon a concerning regression in the Go runtime that doesn't look easy to work around. Go1.14 included a full rewrite of the runtime timer implementation. This was mostly a positive change, but shortly after the 1.14 release, reports started coming in of multi-millisecond timer starvation. This is tracked in this upstream issue: golang/go#38860. A few months later, when we upgraded Cockroach to use 1.14, some of our tests started hitting the same issue, but in our tests, the starvation was much more severe. In #50865, we saw instances of multi-second timer starvation, which wreaked havoc on the cluster's liveness mechanisms.

A partial fix to this runtime issue has landed for Go 1.15 (https://go-review.googlesource.com/c/go/+/232199/) and may be backported to 1.14, but there is currently no ETA on the backport and no guarantee that it will even happen. A more robust fix is scheduled for 1.16 and the PR is still open in the Go repo (https://go-review.googlesource.com/c/go/+/232298/).

As we approach the CockroachDB 20.2 stability period, it seems prudent to avoid the risk of downgrading our Go runtime version too late. For that reason, @lunevalex, @andy-kimball, @petermattis, and I recommend the following course of actions:

* Downgrade 20.2 to 1.13 immediately, so we go into the stability period with the right version of Go.
* Schedule the subsequent improvements to non-preemptable loops in CockroachDB itself in 21.1.
* At the beginning of the 21.1 cycle upgrade CRDB to 1.15 and effectively skip 1.14 all together.

This PR addresses the first of these steps. While here, it also picks up the latest patch release of Go 1.13, moving from 1.13.9 to 1.13.14.

Checklist:

* [X] Adjust version in Docker image ([source](./builder/Dockerfile#L199-L200)).
* [X] Rebuild and push the Docker image (following [Basic Process](#basic-process))
* [X] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)).
* [X] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release.
* [X] Bump the go version in `go.mod`. You may also need to rerun `make vendor_rebuild` if vendoring has changed.
* [X] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh#L40-42)).
* [X] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`).
* [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects.

(I will do the last step after this has merged)

Release note (general change): Revert the Go version back to 1.13.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@ChrisHines
Copy link
Contributor Author

Now that the Go 1.16 tree is open I would like to return attention to this issue. I have responded to all comments on https://go-review.googlesource.com/c/go/+/232298 posted to date. Please review and let me know if you have any additional questions or concerns.

ChrisHines added a commit to ChrisHines/go that referenced this issue Oct 31, 2020
Change the scheduler to consider P's not marked for preemption as
potential targets for timer stealing by spinning P's.

Ignoring timers on P's not marked for preemption, as the scheduler
did previously, has the downside that timers on that P must wait for
its current G to complete or get preempted. But that can take as long
as 10ms.

In addition, this choice is only made when a spinning P is available
and in timer-bound applications it may result in the spinning P
stopping instead of performing available work, reducing parallelism.

In CL 214185 we avoided taking the timer lock of a P with no ready
timers, which reduces the chances of timer lock contention.

Fixes golang#38860

Change-Id: If52680509b0f3b66dbd1d0c13fa574bd2d0bbd57
@golang golang locked and limited conversation to collaborators Oct 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants