Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/metrics: provide histogram of all STW events #63340

Closed
rhysh opened this issue Oct 2, 2023 · 11 comments
Closed

runtime/metrics: provide histogram of all STW events #63340

rhysh opened this issue Oct 2, 2023 · 11 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal Proposal-Accepted
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Oct 2, 2023

The /gc/pauses:seconds metric gives a count of GC-related stop-the-world events (and the distribution of their durations), but there are several other ways for an application to trigger a STW. The prevalence and duration of those is also worth tracking: sometimes the work done while the world is stopped is expensive (#33250), and sometimes the stopping itself is expensive (#31222).

In the runtime today, I see runtime.GOMAXPROCS, runtime.ReadMemStats, runtime/pprof.Lookup("goroutine").WriteTo, runtime/trace.{Start,Stop} (runtime.{Start,Stop}Trace), runtime.Stack(..., true), runtime/debug.WriteHeapDump, syscall.AllThreadsSyscall{,6}, plus GC phase transitions, as being able to cause a STW. It's not unusual for latency-sensitive apps to call one or more of those on a regular basis: in my own experience that's usually goroutine profiles, ReadMemStats (via the expvar package), and execution traces.

I propose adding a new histogram metric that reports on the full duration of all STW events, including the time to bring the world to a stop and the time with the world stopped, tentatively named /sched/pauses:seconds. This will make it easier to identify the presence and impact of stop-the-world pauses, even when they're not related to the GC.

Existing:

/gc/pauses:seconds
	Distribution of individual GC-related stop-the-world pause
	latencies. Bucket counts increase monotonically.

Proposed addition:

/sched/pauses:seconds
	Distribution of individual stop-the-world pause latencies.
	Bucket counts increase monotonically.
@rhysh rhysh added the Proposal label Oct 2, 2023
@gopherbot gopherbot added this to the Proposal milestone Oct 2, 2023
@ianlancetaylor
Copy link
Contributor

CC @golang/runtime

@prattmic prattmic added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 3, 2023
@prattmic
Copy link
Member

prattmic commented Oct 3, 2023

It could be nice to have a full breakdown of each STW type, but I'm not sure the value is worth littering with so many new metrics.

I also wonder if we should have a metric for the time it takes to stop the world (i.e., the runtime of stopTheWorldWithSema). Sometimes we encounter bugs with stopping the world, though often those are complete deadlocks in which case a metric wouldn't help.

cc @mknyszek

@rhysh
Copy link
Contributor Author

rhysh commented Oct 4, 2023

I'd be happy with a pair of histogram metrics (how long to stop the world, and how long it remains stopped). That's usually the first way I'd need to split an investigation; the raw number of STW events (and how it grows over time) should give a good hint at which kinds of STW are prevalent in the app.

To the extent that #45894 is explained by the spinning M mechanism being slow to restart the world, it could be interesting to also have a third "cost of STW" histogram that would shed light on that. Defining it would be tricky, something like "from the start of the call to procresize to the time when either all Ps are running or when the scheduler fails to find work for a P and so returns it to the scheduler as idle". But as it stands now I think that most of #45894 would be explainable by runtime-internal lock contention, #57071.

As of go1.21.1 the runtime/metrics package has 18 entries under /godebug/non-default-behavior/..., with 2 more in tip. The full breakdown here would be 7 more, or 14 if they split again on stopping/stopped. Although runtime/metrics is built for this, I'm inclined to agree that the full set would be too much.

Execution traces give great detail on STW events, so long as the STW events aren't themselves correlated / anti-correlated with the collection of an execution trace. It's very easy to see (most of) how long the world remains stopped at the beginning of an execution trace. But it's impossible to see how long that STW takes to bring the world to a stop. The mechanism I use for most of my profiling collects goroutine profiles immediately before taking any execution traces, so the work of gathering the goroutine profile (including its STW events) never has a chance to show up in the execution trace. Maybe the flight recording API #63185 can change that, through allowing us to look back in time to see interesting STWs that we otherwise would have missed.

@prattmic
Copy link
Member

Often enough I find myself wondering if struggles to stop-the-world are causing latency issues that I think they are worth measuring. Also, rather than measuring "GC" and "all" pauses, IMO it would be cleaner to measure "GC" and "everything else" separately to make the non-GC effects more obvious.

Concretely, I propose that we add:

/sched/pauses/start/gc:seconds
        Distribution of individual GC-related stop-the-world start
        latencies. This is the time it takes from deciding to stop 
        the world until all Ps are stopped. This is a subset of the
        total stop-the-world time (/sched/pauses/total/gc:seconds).
        During this time, some threads may be executing. Bucket counts
        increase monotonically.

/sched/pauses/start/other:seconds
        Distribution of individual non-GC-related stop-the-world start
        latencies. This is the time it takes from deciding to stop 
        the world until all Ps are stopped. This is a subset of the
        total stop-the-world time (/sched/pauses/total/other:seconds).
        During this time, some threads may be executing. Bucket counts
        increase monotonically.

/sched/pauses/total/gc:seconds
        Distribution of individual GC-related stop-the-world pause
        latencies. This is the time from deciding to stop the world
        until the world is started again. Some of this time is spent
        getting all threads to stop (this is measured directly in
        /sched/pauses/start/gc:seconds), during which some threads may
        still be running. Bucket counts increase monotonically.

/sched/pauses/total/other:seconds
        Distribution of individual non-GC-related stop-the-world
        pause latencies. This is the time from deciding to stop the
        world until the world is started again. Some of this time
        is spent getting all threads to stop (measured directly in
        /sched/pauses/start/other:seconds). Bucket counts increase
        monotonically.

With this scheme, /gc/pauses:seconds is conceptually equivalent to /sched/pauses/total/gc:seconds. I propose that we mark /gc/pauses:seconds as deprecated and make it return the exact same histogram as /sched/pauses/total/gc:seconds. I am also open to deleting /gc/pauses:seconds entirely.

I have written a prototype CL implementing this for reference, which I will push shortly.

@gopherbot
Copy link

Change https://go.dev/cl/534161 mentions this issue: runtime/metrics: add STW start and total time metrics

@mknyszek
Copy link
Contributor

mknyszek commented Nov 2, 2023

FWIW I support this. Non-GC STWs has been a gap in our metrics for a long time. +1 to making the metrics orthogonal (GC and then everything else) and +1 to breaking out "time to stop" (listed as "start") as a metric.

I also think it's fine to not have the same kind of orthogonality for "total" vs. "start" that we have for "gc" vs. "other". Histograms can't be combined in quite the same way. I agree "total" and "start" also makes more sense, as they're the most individually useful.

Bikeshedding just a little, maybe "start" could be a bit less opaque? Maybe "bring-to-stop"? And on that note, perhaps "total" could be "end-to-end" or something? I don't know. "total" is probably fine. I don't feel strongly about this.

@rsc
Copy link
Contributor

rsc commented Nov 2, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

Add four new runtime/metrics histograms for tracking GC pauses:

/sched/pauses/start/gc:seconds
/sched/pauses/start/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds

Mark /gc/pauses:seconds as deprecated in favor of /sched/pauses/total/gc:seconds (which is exactly the same).

Perhaps a better name than “start” can be found to indicate the time it takes the world to actually stop (“settle”? “quiesce”? “freeze”?).

@rsc
Copy link
Contributor

rsc commented Nov 2, 2023

s/start/stopping/

@felixge
Copy link
Contributor

felixge commented Nov 4, 2023

Maybe preempting?

@rsc
Copy link
Contributor

rsc commented Nov 10, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

Add four new runtime/metrics histograms for tracking GC pauses:

/sched/pauses/stopping/gc:seconds
/sched/pauses/stopping/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds

Mark /gc/pauses:seconds as deprecated in favor of /sched/pauses/total/gc:seconds (which is exactly the same).

@rsc rsc changed the title proposal: runtime/metrics: provide histogram of all STW events runtime/metrics: provide histogram of all STW events Nov 10, 2023
@rsc rsc modified the milestones: Proposal, Backlog Nov 10, 2023
@gopherbot
Copy link

Change https://go.dev/cl/547095 mentions this issue: doc: add STW metrics to release notes

gopherbot pushed a commit that referenced this issue Dec 6, 2023
For #63340.
For #61422.

Change-Id: Ib74bb54b0450e96b7f4b7eb7ba2ae7ac2d40171a
Reviewed-on: https://go-review.googlesource.com/c/go/+/547095
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
For golang#63340.
For golang#61422.

Change-Id: Ib74bb54b0450e96b7f4b7eb7ba2ae7ac2d40171a
Reviewed-on: https://go-review.googlesource.com/c/go/+/547095
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal Proposal-Accepted
Projects
Status: Accepted
Development

No branches or pull requests

7 participants