Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestMemStats is flaky #38712

Closed
mknyszek opened this issue Apr 27, 2020 · 8 comments
Closed

runtime: TestMemStats is flaky #38712

mknyszek opened this issue Apr 27, 2020 · 8 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mknyszek
Copy link
Contributor

TestMemStats is currently flaking on several builders:

2020-04-27T20:20:53-a7e9e84/linux-386-longtest
2020-04-27T20:20:53-a7e9e84/linux-ppc64le-power9osu
2020-04-27T20:06:57-d2f5e4e/linux-amd64-nocgo
2020-04-27T19:49:49-a637ee1/linux-amd64-clang
2020-04-27T19:49:42-d646c03/linux-amd64-clang
2020-04-27T19:35:01-6a4441d/darwin-amd64-10_12

The failure is

--- FAIL: TestMemStats (0.00s)
    malloc_test.go:98: Bad sys value: ... [blob of text]

Given the timing, the most likely cause is the new mcentral implementation somehow. It's not clear to me yet how it's related, if that's the case.

@mknyszek mknyszek added this to the Go1.15 milestone Apr 27, 2020
@mknyszek mknyszek self-assigned this Apr 27, 2020
@mknyszek
Copy link
Contributor Author

What's particularly strange is the "bad sys value" since the way sys is set is very clear. In the darwin-amd64-10_12 builder, Sys is 256 KiB smaller than the sum of the values it's supposed to match. Maybe there's now an update to one of the values that goes into sys after sys is calculated but before the rest of the values are copied?

@mknyszek
Copy link
Contributor Author

I looked at a few more: all of them seem to be off by 256 KiB.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 28, 2020
@mknyszek
Copy link
Contributor Author

Still happening, and even 32-bit builders are off by exactly 256 KiB.

@mknyszek
Copy link
Contributor Author

mknyszek commented Apr 28, 2020

What this means for certain is that between the first few lines of updatememstats() which set memstats.sys and the memmove of memstats into MemStats one of the values that goes into memstats.sys increases. It's unlikely that this is multiple increases and decreases because the value difference is so consistent. Operating on the assumption that this is the result of exactly one stat only increasing, these are the following stats that could have been increased:

heap_sys
stacks_inuse
buckhash_sys
gc_sys
other_sys

All the other stats are too small (<256 KiB) to be culprits operating under this assumption.

Given all the evidence (the timing implicating mcentral work, the value of 256 KiB) I'm starting to think this is persistentalloc's update of other_sys and gc_sys not getting read properly by ReadMemStats (i.e. it's reading a stale value because it's read non-atomically). Notably, persistentalloc allocates chunks of memory (for allocations < 64 KiB) in 256 KiB blocks. I always figured a STW event would have so many memory barriers in it that this couldn't be a possibility, but maybe? The way this would relate to the mcentral changes is that maybe we're hitting persistentalloc regularly (note that the span set blocks are < 64 KiB in size) due to span set blocks leaking (fixed in: https://golang.org/cl/230497; if this is the case, landing that might also "fix" this, though not the underlying issue).

@mknyszek
Copy link
Contributor Author

Ah but hold up, this is happening on amd64/386 where regular loads are the same as atomic loads (in the runtime) because of TSO (and the increments are atomic).

@mknyszek
Copy link
Contributor Author

Figured it out.

Consider the following chain of events:

  1. updatememstats sets sys
  2. It purges the mcaches and calls mcentral.uncacheSpan on them.
  3. mcentral.uncacheSpan does a push into a span set, causing a persistentalloc.
  4. The persistentalloc needs a new chunk.
  5. memstats.gc_sys and memstats.other_sys are increased in total by 256 KiB.
  6. We do the rest of the MemStats stuff and copy the memstats struct into the MemStats struct.
  7. We now have a stale Sys value.

The fix is to purge the mcaches before doing anything else in updatememstats.

@mknyszek mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label Apr 28, 2020
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 28, 2020
@gopherbot
Copy link

Change https://golang.org/cl/230498 mentions this issue: runtime: flush mcaches to mcentral before reading memstats

@josharian
Copy link
Contributor

@mknyszek when I discussed #24686 with @aclements ages ago he suggested it could let us eliminate mcache. If you encounter other mcache troubles, it might be worth exploring this other route.

xujianhai666 pushed a commit to xujianhai666/go-1 that referenced this issue May 21, 2020
Currently mcaches are flushed to mcentral after a bunch of memstats have
already been read. This is not safe (in the sense that it doesn't ensure
consisent memstats) since memstats may in general change when mcentral
data structures are manipulated.

Note that prior to the new mcentral implementation this was not a
problem because mcentral operations happened to never modify certain
memstats. As of the new mcentral implementation, we might for example
persistentalloc when uncaching a span, which would change memstats. This
can cause a skew between the value of sys (which currently is calculated
before mcaches are flushed) and the value of gc_sys and other_sys.

Fix this by moving mcache flushing to the very top of updatememstats.
Also leave a comment explaining that this must be done first, in
general, because mcentrals make no guarantee that they will not
influence memstats (and doing so would be unnecessarily restrictive).

Fixes golang#38712.

Change-Id: I15bacb313c54a46e380a945a71bb75db67169c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/230498
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
@golang golang locked and limited conversation to collaborators Apr 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants