runtime: shrinkstack during mark termination significantly increases GC STW time #12967

rhysh · 2015-10-17T02:02:53Z

I have a process that experiences significant STW times during the mark termination phase of garbage collection. When it has around 100,000 goroutines and a live heap of around 500MB, its STW pauses take tens of milliseconds. The process runs on linux/amd64 with GOMAXPROCS=18, with a recent version of Go tip (30b9663).

I profiled its CPU usage with perf_events, looking specifically at samples with runtime.gcMark on the stack. 7.5% of samples include runtime.gcMark -> runtime.freeStackSpans (non-parallelized work, discussed in #11465). 90% of samples include runtime.gcMark -> runtime.parfordo -> runtime.markroot, and 75% of samples include runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.shrinkstack.

The work for stack shrinking is split 18 ways in my case, but at 75% of CPU cycles it's still a huge contributor to mark termination pauses, and the pauses my program sees are well beyond the 10ms goal with only a modest heap size.

Does stack shrinking need to take place while the world is stopped?

$ go version
go version devel +30b9663 Tue Oct 13 21:06:58 2015 +0000 linux/amd64

@aclements

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2015-10-17T05:37:22Z

CC @RLH @aclements

randall77 · 2015-10-17T17:20:37Z

We could move stack shrinking to the phase where we stop one goroutine at a time.

aclements · 2015-10-19T19:40:31Z

I agree. I don't think there would be any problem with shrinking each stack right before we do the concurrent scan of that stack. We have to be careful not to use any cross-stack pointers between when we shrink our own stack and when we return from the system stack, but that may actually be easier to ensure if we do the shrink as part of stack scanning.

rhysh · 2015-10-20T04:36:44Z

I've written a reproducer, included below. Its perf_events profile, collected on linux/amd64, looks very similar to what I see for my production program in terms of CPU time spent in runtime.freeStackSpans (10%), in runtime.shrinkstack (83%), in runtime.stackalloc (37%), and in runtime.stackfree (37%).

The reproducer starts 10,000 goroutines which each repeatedly grow and shrink their stack by 10kB. The production program has around 200,000 goroutines and has a garbage collection every 60-120 seconds, so any goroutines that grow and shrink their stacks during that time will contribute to the mark termination pause. (Ironic that generating more garbage, which would lead to more frequent collections, could reduce the duration of each pause.)

Here's output from some runs of the program on darwin/amd64.

$ go version
go version devel +3bc0601 Tue Oct 20 03:16:09 2015 +0000 darwin/amd64

$ GODEBUG=gctrace=1 ./12967 
gc 1 @0.002s 20%: 22+9.1+0.15+2.4+0.72 ms clock, 45+9.1+0+0/4.5/0+1.4 ms cpu, 10->14->13 MB, 15 MB goal, 8 P
gc 2 @0.126s 7%: 0.24+24+0.001+0.065+0.50 ms clock, 0.48+24+0+1.0/0/6.2+1.0 ms cpu, 15->15->15 MB, 27 MB goal, 8 P
gc 3 @0.856s 1%: 0.31+24+0.001+0.067+0.54 ms clock, 2.4+24+0+1.4/0.001/10+4.3 ms cpu, 22->22->16 MB, 30 MB goal, 8 P
gc 4 @2.118s 0%: 0.25+24+0.001+0.066+0.51 ms clock, 2.0+24+0+0.029/0/13+4.1 ms cpu, 28->28->16 MB, 32 MB goal, 8 P
gc 5 @3.684s 2%: 0.19+20+0.002+0.068+79 ms clock, 1.5+20+0+0.95/0.003/6.7+638 ms cpu, 31->31->16 MB, 32 MB goal, 8 P
gc 6 @5.239s 2%: 0.36+24+0.001+0.10+0.53 ms clock, 2.9+24+0+0.031/0.021/13+4.2 ms cpu, 31->31->16 MB, 32 MB goal, 8 P
gc 7 @6.904s 1%: 0.25+29+0.002+0.069+0.59 ms clock, 2.0+29+0+7.2/0.002/0.039+4.7 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 8 @8.551s 1%: 0.22+20+0.002+0.065+23 ms clock, 1.8+20+0+1.7/0/7.1+185 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 9 @10.175s 1%: 0.24+25+0.001+0.071+0.57 ms clock, 1.9+25+0+0/0.014/10+4.5 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 10 @11.690s 1%: 0.21+26+0.002+0.072+0.60 ms clock, 1.6+26+0+5.6/0.002/0.016+4.8 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 11 @13.316s 1%: 0.25+23+0.001+0.054+13 ms clock, 2.0+23+0+1.8/0.006/12+105 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 12 @14.942s 1%: 0.24+19+0.001+0.059+8.8 ms clock, 1.9+19+0+0.80/0.001/7.1+70 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 13 @16.603s 1%: 0.22+25+0.001+0.085+0.61 ms clock, 1.7+25+0+3.8/0.020/7.4+4.9 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 14 @18.239s 1%: 0.22+20+0.001+0.15+14 ms clock, 1.7+20+0+0.034/0.010/10+112 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 15 @19.875s 1%: 0.23+21+0.001+0.069+9.7 ms clock, 1.8+21+0+0/0.002/8.1+78 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 16 @21.392s 0%: 0.21+23+0.001+0.067+0.59 ms clock, 1.6+23+0+0.26/0.002/10+4.7 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 17 @23.014s 0%: 0.20+21+0.001+0.067+13 ms clock, 1.6+21+0+1.7/0.015/9.0+105 ms cpu, 32->32->16 MB, 33 MB goal, 8 P

Disabling stack shrinking makes the mark termination STW pauses significantly shorter.

$ GODEBUG=gctrace=1,gcshrinkstackoff=1 ./12967 
gc 1 @0.002s 20%: 22+6.9+0.21+2.2+0.64 ms clock, 45+6.9+0+0/4.2/0+1.2 ms cpu, 10->14->13 MB, 15 MB goal, 8 P
gc 2 @0.128s 6%: 0.30+23+0.001+0.069+0.56 ms clock, 0.60+23+0+0.82/0.002/8.5+1.1 ms cpu, 15->15->15 MB, 27 MB goal, 8 P
gc 3 @0.841s 1%: 0.21+24+0.001+0.053+0.47 ms clock, 1.7+24+0+1.1/0.004/7.2+3.7 ms cpu, 22->22->16 MB, 30 MB goal, 8 P
gc 4 @2.059s 0%: 0.29+24+0.001+0.058+0.43 ms clock, 2.0+24+0+0/0.010/8.1+3.0 ms cpu, 28->28->15 MB, 32 MB goal, 8 P
gc 5 @3.475s 0%: 0.28+21+0.002+0.077+0.28 ms clock, 2.2+21+0+0.049/0.001/11+2.3 ms cpu, 29->29->16 MB, 30 MB goal, 8 P
gc 6 @4.995s 0%: 0.20+19+0.002+0.077+0.28 ms clock, 1.6+19+0+0/0.027/9.5+2.3 ms cpu, 31->31->15 MB, 32 MB goal, 8 P
gc 7 @6.509s 0%: 0.20+28+0.001+0.068+0.56 ms clock, 1.6+28+0+6.0/0.001/0.032+4.5 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 8 @8.138s 0%: 0.21+20+0.001+0.077+0.27 ms clock, 1.6+20+0+0/0.018/7.0+2.1 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 9 @9.652s 0%: 0.25+21+0.001+0.070+0.26 ms clock, 2.0+21+0+0.28/0.015/10+2.1 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 10 @11.275s 0%: 0.23+22+0.002+0.078+0.64 ms clock, 1.8+22+0+0/0.018/7.6+5.1 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 11 @12.791s 0%: 0.26+21+0.001+0.070+0.23 ms clock, 1.8+21+0+0/0.014/10+1.6 ms cpu, 30->30->15 MB, 31 MB goal, 8 P
gc 12 @14.302s 0%: 0.20+19+0.002+0.073+0.28 ms clock, 1.6+19+0+0/0.002/7.2+2.2 ms cpu, 30->30->15 MB, 31 MB goal, 8 P
gc 13 @15.815s 0%: 0.26+28+0.001+0.066+0.47 ms clock, 2.1+28+0+6.6/0/0.020+3.7 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 14 @17.441s 0%: 0.26+24+0.001+0.067+0.56 ms clock, 2.0+24+0+1.2/0.001/8.5+4.5 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 15 @19.064s 0%: 0.21+20+0.001+0.077+0.24 ms clock, 1.7+20+0+0/0.017/9.6+1.9 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 16 @20.577s 0%: 0.22+22+0.001+0.079+0.54 ms clock, 1.8+22+0+0/0.002/9.5+4.3 ms cpu, 30->30->15 MB, 31 MB goal, 8 P
gc 17 @22.091s 0%: 0.21+26+0.001+0.070+0.55 ms clock, 1.6+26+0+4.8/0.001/0.014+4.4 ms cpu, 30->30->16 MB, 31 MB goal, 8 P

package main

import (
    "sync"
    "time"
)

const (
    ballastSize   = 10 << 20
    garbageSize   = 1 << 20
    garbagePeriod = 100 * time.Millisecond

    stackCount  = 10000
    stackSize   = 10 << 10
    stackPeriod = 5 * time.Second
)

var (
    ballast []byte
    garbage []byte
)

func churn() {
    ballast = make([]byte, ballastSize)

    for {
        time.Sleep(garbagePeriod)
        garbage = make([]byte, garbageSize)
    }
}

func stack(a, b *sync.WaitGroup) {
    for {
        grow(a)
        b.Wait()
    }
}

func grow(a *sync.WaitGroup) byte {
    var s [stackSize]byte
    a.Wait()
    return s[0]
}

func main() {
    go churn()

    var a, b sync.WaitGroup
    a.Add(1)
    for i := 0; i < stackCount; i++ {
        go stack(&a, &b)
    }

    for {
        time.Sleep(stackPeriod / 2)
        b.Add(1)
        a.Add(-1)

        time.Sleep(stackPeriod / 2)
        a.Add(1)
        b.Add(-1)
    }
}

aclements · 2015-10-26T16:25:12Z

I thought of one problem with shrinking stacks during concurrent stack scan. If the goroutine is blocked in a channel operation or select, there are pointers in to that G's stack from runtime structures and updating these pointers during shrinking could race with another G completing that channel operation or select. This isn't an issue for stack growing because that's always done synchronously.

I can think of a few solutions:

We could not shrink the stack concurrently. This would be unfortunate, since goroutines are often blocked on channels for a long time.
The stack shrinking code could detect this condition and lock the appropriate channels while shrinking to prevent a channel operation from completing during the shrink. This might not be so hard: we could add the *hchan to the sudog and then walk the g.waiting list in shrinkstack. We'd probably want to rearrange copystack a bit so we didn't have to keep these channels locked for the whole copystack. This has the downside of blocking all other operations on these channels even if the G being shrunk is way back in the channel's queue.
Channel operations could detect this condition and wait for the stack shrink to finish or just requeue the waiting G (which doesn't help if this is the only G waiting on the channel). Channel operations would also have to block a stack shrink from starting during the operation. This has the advantage of only blocking operations that actually deal with the G being shrunk. We could probably do this by synchronizing on the G's status. Introduce a new status for "in a channel op" and make the channel completion put the G in to this status. If the G is in _Gcopystack, that will spin until the copy stack is done. If it's not in _Gcopystack, that will cause a copystack to spin during the channel op completion until it can put it in to _Gcopystack.

rhysh · 2015-10-27T06:03:50Z

Does your first solution mean that the stacks would be shrunk during a STW phase, or that they would not be shrunk at all?

Most of the goroutines in the application I described are blocked on channel operations nearly all the time. The goroutines occasionally get a message on their channel that makes them call a function with a large stack requirement (and which returns quickly).

Locking all goroutines that are attempting to communicate on a channel, as described in the second solution, would make "quit" channels more expensive if shared across many goroutines. Stack shrinking on any one of the participating goroutines would delay all of the others.

The third solution seems easy to reason about, and sounds like it will have predictable performance.

rsc · 2015-11-05T17:43:19Z

Per Austin, this is too invasive for Go 1.6. He may look into having it ready for the start of the 1.7 cycle.

randall77 · 2015-11-08T23:41:39Z

So as I see it there are a couple of options to enable shrinking outside a STW phase.

Have receivers allocate space for the received data on the heap.
Have unbuffered channels allocate one slot which is used to pass the sent value.
Shrinker must grab the channel lock.
Sender & shrinker must grab a lock on the stack (or G, as Austin proposes)
Shrink stacks in place.

1 means allocation per channel receive.
2 means some extra memory + some extra synchronization.
3 might be difficult for select statements, we'd need to grab all the chan locks.
4 means a second lock per send.
5 means possibly greater fragmentation.

As others have also pointed out, I don't think preventing shrinking while in receive is a viable option.
Any others?

ianlancetaylor · 2015-11-09T15:32:11Z

You probably considered this, but everything flows through the sudog, so we could add a lock there. Once the sudog is selected for a channel op, we can lock it. Stack shrinking already knows about the sudogs associated with the channel, it could lock all of them first. If stack shrinking sees that a sudog is already locked, it can punt and wait until next time. If the channel op sees that the sudog is locked, we know there can be at must one channel op pending for a goroutine, so the channel op can sleep on a note until the stack shrinking is complete. So I think the sudog lock can be a single atomic integer managed through cas, plus a note. I think the overhead in the normal case is one additional atomic cas and one atomic store per channel send.

aclements · 2015-11-09T16:16:11Z

@RLH and I were talking about another solution that sort of combines @randall77's first suggestion and @ianlancetaylor's suggestion: reserve a few words in the sudog for receiving small values (which are presumably the majority of channel ops) and otherwise allocate space on the heap for a slot. The receiver would then copy the value in to its own stack when it wakes up. This would eliminate cross-stack writes, at the cost of slightly larger sudogs and a heap allocation for passing large values.

RLH · 2015-11-09T16:23:59Z

I suspect that we can also make the sudog slot solution lock free, just cas
in the value, possible using a 2 word cas so we can get status bits also.

On Mon, Nov 9, 2015 at 11:16 AM, Austin Clements notifications@github.com
wrote:

@RLH https://github.com/RLH and I were talking about another solution
that sort of combines @randall77 https://github.com/randall77's first
suggestion and @ianlancetaylor https://github.com/ianlancetaylor's
suggestion: reserve a few words in the sudog for receiving small values
(which are presumably the majority of channel ops) and otherwise allocate
space on the heap for a slot. The receiver would then copy the value in to
its own stack when it wakes up. This would eliminate cross-stack writes, at
the cost of slightly larger sudogs and a heap allocation for passing large
values.

—
Reply to this email directly or view it on GitHub
#12967 (comment).

aclements · 2015-11-09T16:33:44Z

@RLH, I think the sudog solution doesn't introduce any additional locks or even atomic ops over what we have now. Existing locking already ensures the sudog isn't going anywhere and there's no competition for writing to a slot in it since the sudog represents a single (goroutine, channel) pair.

randall77 · 2015-11-09T16:59:54Z

The trouble with putting some space in the sudog is that it needs to be typed correctly for GC. So if we wanted 4 words of space, we'd need 16 pseudog types to encode ptr/nonptr for each of those 4 slots.

This isn't a showstopper, but it complicates things like acquireSudog.

randall77 · 2015-11-09T17:46:46Z

Of course, if we went the route of different sudog types we could make them exactly the required size.

aclements · 2016-01-13T23:05:43Z

@davecheney's email (https://groups.google.com/d/topic/golang-dev/SJ3LAW5ZXDs/discussion) got me thinking about this again; specifically @randall77's 5th option of shrinking stacks in place. The benefit is of course that this would turn a fairly expensive moving operation into a cheap metadata operation.

The concern with shrinking stacks in place is that it would increase fragmentation. However, I wonder how bad of a problem this really is and to what extent we could mitigate it. I think the worst-case here is a large number of goroutines start and grow their stacks to 32K, then all shrink their stacks to 2K. If we do this all in place, each 2K stack will pin a 32K span for stacks, a 16X fragmentation ratio. However, this is nothing new; you can do the same thing by starting lots of goroutines and selectively exiting them, or even carefully allocating and releasing objects. But all of these situations naturally remedy themselves as a program continues doing real work. Fragmentation will decrease as other goroutines start (and use the free slots in the existing stack spans) or as the running goroutines grow their stacks (forcing them to move) or exit.

We could also actively mitigate fragmentation. We could move stacks if we detect large fragmentation, but I'd just as soon not. Alternatively, we could combat fragmentation generally by switching from the current size-segregated stack free lists to a buddy allocator. The constraints of stack allocation fit quite nicely with a buddy allocator. This would let one span be used for multiple stack size classes, reducing size calcification. It would efficiently merge neighboring free slots and break up large slots when necessary, while keeping allocation and freeing very efficient. It would only require 16 bits of metadata per stack span, which we could easily steal from several places. We could even do opportunistic in-place stack growth for small stacks. This wouldn't prevent the above worst-case fragmentation, but it would remedy it faster by allowing the free space to be used for other size classes.

aclements · 2016-02-29T03:29:22Z

Well that turned out to be hairier than I expected.

@rhysh, care to give https://go-review.googlesource.com/20044 a try? For me it reduces the STW time on your test program to under 2ms.

gopherbot · 2016-02-29T04:04:49Z

CL https://golang.org/cl/20033 mentions this issue.

gopherbot · 2016-02-29T04:05:00Z

CL https://golang.org/cl/20041 mentions this issue.

gopherbot · 2016-02-29T04:05:01Z

CL https://golang.org/cl/20034 mentions this issue.

gopherbot · 2016-02-29T04:05:38Z

CL https://golang.org/cl/20039 mentions this issue.

gopherbot · 2016-02-29T04:05:59Z

CL https://golang.org/cl/20042 mentions this issue.

gopherbot · 2016-02-29T04:06:10Z

CL https://golang.org/cl/20038 mentions this issue.

gopherbot · 2016-02-29T04:06:13Z

CL https://golang.org/cl/20040 mentions this issue.

gopherbot · 2016-02-29T04:06:24Z

CL https://golang.org/cl/20037 mentions this issue.

gopherbot · 2016-02-29T04:06:25Z

CL https://golang.org/cl/20035 mentions this issue.

gopherbot · 2016-02-29T04:06:31Z

CL https://golang.org/cl/20044 mentions this issue.

rhysh · 2016-03-03T04:03:24Z

Thanks @aclements, your changes result in a big improvement in mark termination time for my application, bringing it down to around 20ms with ~120k goroutines and ~500MB live heap (compared with 50ms at 90k goroutines and 330MB heap on go1.6). Some of the extra pause time may be attributable to #14419 or to #14406, which I did not mitigate while testing this change.

I've been testing with CL 20044 PS 2.

go1.6 with framepointer, ~90k goroutines:

gc 463 @23853.946s 0%: 14+69+49 ms clock, 478+716/606/779+1583 ms cpu, 652->653->334 MB, 669 MB goal, 36 P
gc 464 @23920.847s 0%: 5.9+66+48 ms clock, 190+716/573/826+1546 ms cpu, 653->653->334 MB, 669 MB goal, 36 P
gc 465 @23991.665s 0%: 5.5+68+48 ms clock, 177+646/575/780+1567 ms cpu, 651->652->334 MB, 668 MB goal, 36 P
gc 466 @24060.214s 0%: 6.8+75+50 ms clock, 185+939/657/439+1350 ms cpu, 651->651->333 MB, 668 MB goal, 36 P
gc 467 @24132.336s 0%: 5.7+64+53 ms clock, 185+796/559/650+1721 ms cpu, 650->651->333 MB, 667 MB goal, 36 P
gc 468 @24204.829s 0%: 6.4+80+61 ms clock, 205+778/665/805+1972 ms cpu, 650->650->333 MB, 666 MB goal, 36 P
gc 469 @24280.941s 0%: 5.1+69+56 ms clock, 165+645/578/808+1804 ms cpu, 650->652->332 MB, 667 MB goal, 36 P
gc 470 @24347.731s 0%: 5.3+80+53 ms clock, 172+754/689/611+1718 ms cpu, 647->649->332 MB, 664 MB goal, 36 P
gc 471 @24423.086s 0%: 10+74+48 ms clock, 326+595/635/875+1567 ms cpu, 647->649->331 MB, 663 MB goal, 36 P
gc 472 @24493.659s 0%: 5.5+69+50 ms clock, 176+1110/611/452+1616 ms cpu, 644->644->330 MB, 660 MB goal, 36 P

CL 20044 PS 2, devel +f9e73e9 Mon Feb 29 11:13:35 2016 -0500 with framepointer, ~120k goroutines:

gc 659 @42602.290s 0%: 4.6+108+21 ms clock, 74+1353/956/1017+345 ms cpu, 945->945->485 MB, 969 MB goal, 36 P
gc 660 @42655.022s 0%: 5.7+115+22 ms clock, 182+1215/1004/1070+709 ms cpu, 946->946->487 MB, 970 MB goal, 36 P
gc 661 @42701.260s 0%: 5.1+111+22 ms clock, 165+1230/967/1143+718 ms cpu, 951->952->488 MB, 975 MB goal, 36 P
gc 662 @42748.210s 0%: 5.0+106+19 ms clock, 100+1417/929/914+391 ms cpu, 953->953->489 MB, 977 MB goal, 36 P
gc 663 @42801.338s 0%: 8.2+98+17 ms clock, 265+1200/867/1211+569 ms cpu, 954->956->490 MB, 978 MB goal, 36 P
gc 664 @42847.753s 0%: 4.0+123+24 ms clock, 130+1347/948/984+796 ms cpu, 954->956->490 MB, 978 MB goal, 36 P
gc 665 @42891.652s 0%: 5.7+93+18 ms clock, 184+1097/821/1143+582 ms cpu, 957->957->490 MB, 981 MB goal, 36 P
gc 666 @42947.239s 0%: 8.0+109+18 ms clock, 258+1210/941/1224+577 ms cpu, 957->960->492 MB, 981 MB goal, 36 P
gc 667 @42993.313s 0%: 11+105+20 ms clock, 368+1298/908/1167+644 ms cpu, 958->959->492 MB, 982 MB goal, 36 P
gc 668 @43047.532s 0%: 10+102+19 ms clock, 325+1245/896/1163+636 ms cpu, 960->962->493 MB, 984 MB goal, 36 P

Currently copystack adjusts pointers in the old stack and then copies the adjusted stack to the new stack. In addition to being generally confusing, this is going to make concurrent stack shrinking harder. Switch this around so that we first copy the stack and then adjust pointers on the new stack (never writing to the old stack). This reprises CL 15996, but takes a different and simpler approach. CL 15996 still walked the old stack while adjusting pointers on the new stack. In this CL, we adjust auxiliary structures before walking the stack, so we can just walk the new stack. For golang#12967. Change-Id: I94fa86f823ba9ee478e73b2ba509eed3361c43df Reviewed-on: https://go-review.googlesource.com/20033 Reviewed-by: Rick Hudson <rlh@golang.org>

In particular, write down the rules for stack ownership because the details of this are about to get very important with concurrent stack shrinking. (Interestingly, the details don't actually change, but anything that's currently skating on thin ice is likely to fall through.) Fox #12967. Change-Id: I561e2610e864295e9faba07717a934aabefcaab9 Reviewed-on: https://go-review.googlesource.com/20034 Reviewed-by: Rick Hudson <rlh@golang.org>

Given a G, there's currently no way to find the channel it's blocking on. We'll need this information to fix a (probably theoretical) bug in select and to implement concurrent stack shrinking, so record the channel in the sudog. For #12967. Change-Id: If8fb63a140f1d07175818824d08c0ebeec2bdf66 Reviewed-on: https://go-review.googlesource.com/20035 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Currently the g.waiting list created by a select is in poll order. However, nothing depends on this, and we're going to need access to the channel lock order in other places shortly, so modify select to put the waiting list in channel lock order. For #12967. Change-Id: If0d38816216ecbb37a36624d9b25dd96e0a775ec Reviewed-on: https://go-review.googlesource.com/20037 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com>

gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

With concurrent stack shrinking, the stack can move the instant after a G enters _Gwaiting. There are only two places that put a G into _Gwaiting: gopark and newstack. We fixed uses of gopark. This commit fixes newstack by simplifying its G transitions and, in particular, eliminating or narrowing the transient _Gwaiting states it passes through so it's clear nothing in the G is accessed while in _Gwaiting. For #12967. Change-Id: I2440ead411d2bc61beb1e2ab020ebe3cb3481af9 Reviewed-on: https://go-review.googlesource.com/20039 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Currently sudog.elem is never accessed concurrently, so in several cases we drop the channel lock just before reading/writing the sent/received value from/to sudog.elem. However, concurrent stack shrinking is going to have to adjust sudog.elem to point to the new stack, which means it needs a way to synchronize with accesses to sudog.elem. Hence, add sudog.elem to the fields protected by hchan.lock and scoot the unlocks down past the uses of sudog.elem. While we're here, better document the channel synchronization rules. For #12967. Change-Id: I3ad0ca71f0a74b0716c261aef21b2f7f13f74917 Reviewed-on: https://go-review.googlesource.com/20040 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Currently, locking a G's stack by setting its status to _Gcopystack or _Gscan is unordered with respect to channel locks. However, when we make stack shrinking concurrent, stack shrinking will need to lock the G and then acquire channel locks, which imposes an order on these. Document this lock ordering and fix closechan to respect it. Everything else already happens to respect it. For #12967. Change-Id: I4dd02675efffb3e7daa5285cf75bf24f987d90d4 Reviewed-on: https://go-review.googlesource.com/20041 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Currently shinkstack is only safe during STW because it adjusts channel-related stack pointers and moves send/receive stack slots without synchronizing with the channel code. Make it safe to use when the world isn't stopped by: 1) Locking all channels the G is blocked on while adjusting the sudogs and copying the area of the stack that may contain send/receive slots. 2) For any stack frames that may contain send/receive slot, using an atomic CAS to adjust pointers to prevent races between adjusting a pointer in a receive slot and a concurrent send writing to that receive slot. In principle, the synchronization could be finer-grained. For example, we considered synchronizing around the sudogs, which would allow channel operations involving other Gs to continue if the G being shrunk was far enough down the send/receive queue. However, using the channel lock means no additional locks are necessary in the channel code. Furthermore, the stack shrinking code holds the channel lock for a very short time (much less than the time required to shrink the stack). This does not yet make stack shrinking concurrent; it merely makes doing so safe. This has negligible effect on the go1 and garbage benchmarks. For #12967. Change-Id: Ia49df3a8a7be4b36e365aac4155a2416b94b988c Reviewed-on: https://go-review.googlesource.com/20042 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com>

rhysh mentioned this issue Oct 17, 2015

runtime: move delayed stack free after start-the-world #11465

Closed

ianlancetaylor added this to the Go1.6 milestone Oct 17, 2015

rsc modified the milestones: Unplanned, Go1.6 Nov 5, 2015

aclements mentioned this issue Nov 8, 2015

runtime: reconsider stack shrinking #13183

Closed

rhysh mentioned this issue Feb 19, 2016

runtime: GC's mark termination STW time extended by Linux NUMA migration #14406

Closed

gopherbot closed this as completed in f11e4eb Mar 16, 2016

aclements modified the milestones: Go1.7, Unplanned Mar 18, 2016

golang locked and limited conversation to collaborators Mar 19, 2017

gopherbot added the FrozenDueToAge label Mar 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: shrinkstack during mark termination significantly increases GC STW time #12967

runtime: shrinkstack during mark termination significantly increases GC STW time #12967

rhysh commented Oct 17, 2015

ianlancetaylor commented Oct 17, 2015

randall77 commented Oct 17, 2015

aclements commented Oct 19, 2015

rhysh commented Oct 20, 2015

aclements commented Oct 26, 2015

rhysh commented Oct 27, 2015

rsc commented Nov 5, 2015

randall77 commented Nov 8, 2015

ianlancetaylor commented Nov 9, 2015

aclements commented Nov 9, 2015

RLH commented Nov 9, 2015

aclements commented Nov 9, 2015

randall77 commented Nov 9, 2015

randall77 commented Nov 9, 2015

aclements commented Jan 13, 2016

aclements commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

rhysh commented Mar 3, 2016

runtime: shrinkstack during mark termination significantly increases GC STW time #12967

runtime: shrinkstack during mark termination significantly increases GC STW time #12967

Comments

rhysh commented Oct 17, 2015

ianlancetaylor commented Oct 17, 2015

randall77 commented Oct 17, 2015

aclements commented Oct 19, 2015

rhysh commented Oct 20, 2015

aclements commented Oct 26, 2015

rhysh commented Oct 27, 2015

rsc commented Nov 5, 2015

randall77 commented Nov 8, 2015

ianlancetaylor commented Nov 9, 2015

aclements commented Nov 9, 2015

RLH commented Nov 9, 2015

aclements commented Nov 9, 2015

randall77 commented Nov 9, 2015

randall77 commented Nov 9, 2015

aclements commented Jan 13, 2016

aclements commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

gopherbot commented Feb 29, 2016

rhysh commented Mar 3, 2016