Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: tight loops should be preemptible #10958

Closed
aclements opened this issue May 26, 2015 · 157 comments
Closed

runtime: tight loops should be preemptible #10958

aclements opened this issue May 26, 2015 · 157 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@aclements
Copy link
Member

Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.

One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.

This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.

@RLH

@aclements aclements added this to the Go1.6 milestone May 26, 2015
@minux
Copy link
Member

minux commented May 26, 2015 via email

@randall77
Copy link
Contributor

The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.

@minux
Copy link
Member

minux commented May 26, 2015 via email

@ianlancetaylor
Copy link
Contributor

It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding

    uint8 b
  loop:
    ++b
    if b == 0 {
        preemptCheck()
    }

then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.

@randall77
Copy link
Contributor

No pointer writes to the heap.

But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.

@aclements
Copy link
Member Author

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.

However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.

@RLH
Copy link
Contributor

RLH commented May 27, 2015

This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652

Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.

On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.

However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.


Reply to this email directly or view it on GitHub
#10958 (comment).

@kingluo
Copy link

kingluo commented Sep 9, 2015

@aclements,

Currently goroutines are only preemptible at function call points.

To be precise, the preemption check seems only at newstack()?

For example,

package main

import (
    "fmt"
    "runtime"
    "time"
)

func test() {
    a := 100
    for i := 1; i < 1000; i++ {
        a = i*100/i + a
    }
}

func main() {
    runtime.GOMAXPROCS(1)
    go func() {
        for {
            test()
        }
    }()
    time.Sleep(100 * time.Millisecond)
    fmt.Println("hello world")
}

The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.

@randall77
Copy link
Contributor

test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?

@griesemer
Copy link
Contributor

For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:

  1. Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).

This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:

  1. A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.

My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.

If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.

@nightlyone
Copy link
Contributor

Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.

Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.

Such logic also reads better when using debuggers.

@griesemer
Copy link
Contributor

@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).

@nightlyone
Copy link
Contributor

@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.

So the pseudo-go-code before unrolling would look like this:

loop:   
        // loop-body        

        if counter > N {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

after unrolling it would look like

loop:   
        // loop-body0
        // loop-body2
        // ...
        // loop-bodyM        

        if counter > N/M {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

So the code inserted would be a constant overhead, but still run every N loops.

@aclements
Copy link
Member Author

The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.

One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.

@griesemer
Copy link
Contributor

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

@aclements
Copy link
Member Author

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.

@randall77
Copy link
Contributor

The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.

@RLH
Copy link
Contributor

RLH commented Sep 11, 2015

The load is not dependent on anything in the loop. Assuming the loop is
tight, the value is almost certainly to a location already in the L1 cache,
the branch will be highly predictable. Just to make sure the branch
predictor doesn't even need to be warmed up, we can make it a backward
branch. I would be sort of surprised that if on an out of order machine the
cost would even be noticed. That said build and measure is the only way to
be sure.

On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
wrote:

The new SSA register allocator handles situations like this well, keeping
everything in registers on the common edges and spill/call/restore on the
unlikely case.


Reply to this email directly or view it on GitHub
#10958 (comment).

@gopherbot
Copy link

CL https://golang.org/cl/19516 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 16, 2016
TestCrashDumpsAllThreads carefully sets the number of Ps to one
greater than the number of non-preemptible loops it starts so that the
main goroutine can continue to run (necessary because of #10958).
However, if GC starts, it can take over that one spare P and lock up
the system while waiting for the non-preemptible loops, causing the
test to eventually time out. This deadlock is easily reproducible if
you run the runtime test with GOGC=1.

Fix this by forcing GOGC=off when running this test.

Change-Id: Ifb22da5ce33f9a61700a326ea92fcf4b049721d1
Reviewed-on: https://go-review.googlesource.com/19516
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
gopherbot pushed a commit that referenced this issue Nov 2, 2019
This adds asynchronous preemption function for amd64 and 386. These
functions spill and restore all register state that can be used by
user Go code.

For the moment we stub out the other arches.

For #10958, #24543.

Change-Id: I6f93fabe9875f4834922a5712362e79045c00aca
Reviewed-on: https://go-review.googlesource.com/c/go/+/201759
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Nov 2, 2019
This adds support for scanning the stack when a goroutine is stopped
at an async safe point. This is not yet lit up because asyncPreempt is
not yet injected, but prepares us for that.

This works by conservatively scanning the registers dumped in the
frame of asyncPreempt and its parent frame, which was stopped at an
asynchronous safe point.

Conservative scanning works by only marking words that are pointers to
valid, allocated heap objects. One complication is pointers to stack
objects. In this case, we can't determine if the stack object is still
"allocated" or if it was freed by an earlier GC. Hence, we need to
propagate the conservative-ness of scanning stack objects: if all
pointers found to a stack object were found via conservative scanning,
then the stack object itself needs to be scanned conservatively, since
its pointers may point to dead objects.

For #10958, #24543.

Change-Id: I7ff84b058c37cde3de8a982da07002eaba126fd6
Reviewed-on: https://go-review.googlesource.com/c/go/+/201761
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Nov 2, 2019
This adds support for pausing a running G by sending a signal to its
M.

The main complication is that we want to target a G, but can only send
a signal to an M. Hence, the protocol we use is to simply mark the G
for preemption (which we already do) and send the M a "wake up and
look around" signal. The signal checks if it's running a G with a
preemption request and stops it if so in the same way that stack check
preemptions stop Gs. Since the preemption may fail (the G could be
moved or the signal could arrive at an unsafe point), we keep a count
of the number of received preemption signals. This lets stopG detect
if its request failed and should be retried without an explicit
channel back to suspendG.

For #10958, #24543.

Change-Id: I3e1538d5ea5200aeb434374abb5d5fdc56107e53
Reviewed-on: https://go-review.googlesource.com/c/go/+/201760
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Nov 2, 2019
This adds signal-based preemption to preemptone.

Since STW and forEachP ultimately use preemptone, this also makes
these work with async preemption.

This also makes freezetheworld more robust so tracebacks from fatal
panics should be far less likely to report "goroutine running on other
thread; stack unavailable".

For #10958, #24543. (This doesn't fix it yet because asynchronous
preemption only works on POSIX platforms on 386 and amd64 right now.)

Change-Id: If776181dd5a9b3026a7b89a1b5266521b95a5f61
Reviewed-on: https://go-review.googlesource.com/c/go/+/201762
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Nov 2, 2019
This adds a test of preempting a loop containing no synchronous safe
points for STW and stack scanning.

We couldn't add this test earlier because it requires scheduler, STW,
and stack scanning preemption to all be working.

For #10958, #24543.

Change-Id: I73292db78ca3d14aab11bdafd26d03986920ef0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/201777
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/204340 mentions this issue: runtime: support preemption on windows/{386,amd64}

@gopherbot
Copy link

Change https://golang.org/cl/207779 mentions this issue: runtime: ensure thread handle is valid in profileloop1

@gopherbot
Copy link

Change https://golang.org/cl/207961 mentions this issue: runtime: support non-cooperative preemption on windows/arm

gopherbot pushed a commit that referenced this issue Nov 20, 2019
On Windows, there is currently a race between unminit closing the
thread's handle and profileloop1 suspending the thread using its
handle. If another handle reuses the same handle value, this can lead
to unpredictable results.

To fix this, we protect the thread handle with a lock and duplicate it
under this lock in profileloop1 before using it.

This is going to become a much bigger problem with non-cooperative
preemption (#10958, #24543), which uses the same basic mechanism as
profileloop1.

Change-Id: I9d62b83051df8c03f3363344438e37781a69ce16
Reviewed-on: https://go-review.googlesource.com/c/go/+/207779
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
gopherbot pushed a commit that referenced this issue Nov 20, 2019
This implements preemptM on Windows using SuspendThead and
ResumeThread.

Unlike on POSIX platforms, preemptM on Windows happens synchronously.
This means we need a make a few other tweaks to suspendG:

1. We need to CAS the G back to _Grunning before doing the preemptM,
   or there's a good chance we'll just catch the G spinning on its
   status in the runtime, which won't be preemptible.

2. We need to rate-limit preemptM attempts. Otherwise, if the first
   attempt catches the G at a non-preemptible point, the busy loop in
   suspendG may hammer it so hard that it never makes it past that
   non-preemptible point.

Updates #10958, #24543.

Change-Id: Ie53b098811096f7e45d864afd292dc9e999ce226
Reviewed-on: https://go-review.googlesource.com/c/go/+/204340
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/208218 mentions this issue: runtime: stress testing for non-cooperative preemption

@aclements
Copy link
Member Author

Go 1.14 will handle loop preemption far better than previous versions, but there are still some loose ends to tie up in 1.15, so I'm bumping this issue to 1.15.

@aclements
Copy link
Member Author

Actually, rather than bump this to 1.15, since tight loops are preemptible now, I've created a new issue to keep track of loose ends (#36365) and will go ahead and close this issue as fixed.

@gopherbot
Copy link

Change https://golang.org/cl/213837 mentions this issue: runtime: protect against external code calling ExitProcess

gopherbot pushed a commit that referenced this issue Jan 9, 2020
On Windows, we implement asynchronous preemption using SuspendThread
to suspend other threads in our process. However, SuspendThread is
itself actually asynchronous (it enqueues a kernel "asynchronous
procedure call" and returns). Unfortunately, Windows' ExitProcess API
kills all threads except the calling one and then runs APCs. As a
result, if SuspendThread and ExitProcess are called simultaneously,
the exiting thread can be suspended and the suspending thread can be
exited, leaving behind a ghost process consisting of a single thread
that's suspended.

We've already protected against the runtime's own calls to
ExitProcess, but if Go code calls external code, there's nothing
stopping that code from calling ExitProcess. For example, in #35775,
our own call to racefini leads to C code calling ExitProcess and
occasionally causing a deadlock.

This CL fixes this by introducing synchronization between calling
external code on Windows and preemption. It adds an atomic field to
the M that participates in a simple CAS-based synchronization protocol
to prevent suspending a thread running external code. We use this to
protect cgocall (which is used for both cgo calls and system calls on
Windows) and racefini.

Tested by running the flag package's TestParse test compiled in race
mode in a loop. Before this change, this would reliably deadlock after
a few minutes.

Fixes #35775.
Updates #10958, #24543.

Change-Id: I50d847abcdc2688b4f71eee6a75eca0f2fee892c
Reviewed-on: https://go-review.googlesource.com/c/go/+/213837
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
@golang golang locked and limited conversation to collaborators Jan 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests