Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: tests hang on macOS due to Apple libc fork bugs #56784

Closed
rsc opened this issue Nov 17, 2022 · 14 comments
Closed

os/exec: tests hang on macOS due to Apple libc fork bugs #56784

rsc opened this issue Nov 17, 2022 · 14 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Nov 17, 2022

On my x86 Mac laptop using macOS 12.6.1, all.bash often hangs in the os/exec test. In particular, this never finishes:

% cd go/src/os/exec
% for (i in `{seq 100}) go test -short -count=1

The chance of a hang in any given iteration is something like 50%. It's possible this is related to #33565, but I'm opening a separate bug just in case, and to focus the discussion on the fact that our own os/exec tests don't pass.

If I attach to the hung process in lldb, I was originally seeing backtraces like:

  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240

This specific hang seems to match dart-lang/sdk#29539, and inspection of the Apple libc code shows that the problem is a race with an os_alloc_once that is in progress in the parent when the address space is split, making the same call die in the child. I changed the Go runtime to do an early call to notify_is_valid_token(0) in osinit. That call is a no-op except that it guarantees the os_alloc_once has been done already, so it cannot race with any future forks.

With that fix, I get a different hang:

  * frame #0: 0x00007ff8015e53ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff80161fa6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00007ff8014c2aca libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 115
    frame #3: 0x00007ff8014b4194 libobjc.A.dylib`initializeNonMetaClass + 646
    frame #4: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
    frame #5: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
    frame #6: 0x00007ff8014b3c18 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 232
    frame #7: 0x00007ff8014b3995 libobjc.A.dylib`lookUpImpOrForward + 1087
    frame #8: 0x00007ff8014b2f9b libobjc.A.dylib`_objc_msgSend_uncached + 75
    frame #9: 0x00007ff80137145e libxpc.dylib`xpc_atfork_child + 125
    frame #10: 0x00007ff80c41ac8e libSystem.B.dylib`libSystem_atfork_child + 63
    frame #11: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #12: 0x000000000106d5df exec.test`runtime.syscall.abi0 + 31
    frame #13: 0x000000000106b444 exec.test`runtime.asmcgocall.abi0 + 100
    frame #14: 0x0000000001069589 exec.test`runtime.systemstack.abi0 + 73
    frame #15: 0x00000000010682ab exec.test`syscall.rawSyscall + 139
    frame #16: 0x0000000001077250 exec.test`syscall.forkAndExecInChild + 240

This one seems to match what @jacobvosmaer posted in #33565 (comment).

I can't find the libobjc source code so I'm not sure what a workaround for xpc_atfork_child might be.

It must be that C programs on macOS do not use fork. I looked into posix_spawn but it looks like we don't have any other ports that use that.

We need to figure something out for Go 1.20 though.

@rsc rsc added NeedsFix The path to resolution is known, but the work has not been done. release-blocker labels Nov 17, 2022
@rsc rsc added this to the Go1.20 milestone Nov 17, 2022
@randall77
Copy link
Contributor

My Mac Pro (x86) running 13.0.1 has no trouble, with either Go 1.19.3 or tip.

@rsc
Copy link
Contributor Author

rsc commented Nov 17, 2022

Updating my laptop to 13.0 did not work.
It did work - 100 PASS in a row - to add another osinit-time call, this one to xpc_atfork_child.
That seems to warm up whatever fast paths it will need in the child to avoid deadlocks.
As I write this I've been offered 13.0.1 so I will try that update just to double-check that this is really needed.

@gopherbot
Copy link

Change https://go.dev/cl/451735 mentions this issue: runtime: work around Apple libc bugs to make exec stop hanging

@rsc
Copy link
Contributor Author

rsc commented Nov 17, 2022

13.0.1 didn't help. Same problem on my M1 MacBook Pro with 12.0.1. go.dev/cl/451735 fixes them both.

@rsc
Copy link
Contributor Author

rsc commented Nov 18, 2022

We have since discovered that what is unique about my laptop compared to others is that I was running my tests from under a program written and linked against CoreFoundation, which put a magic __CF_USER_TEXT_ENCODING=0x552D:0x0:0x0 variable in my environment. (The 0x552D is my uid.) The os/exec test is linked against CoreFoundation too. If I clear that variable or mangle it to have the wrong uid, causing it to be recomputed, then different things happen at startup and seem to bump things around a bit so that these two fork hangs are far less likely.

We'll keep the fix, it's just even more mysterious now.

@jacobvosmaer
Copy link

I filed the related issue #33565. I just checked, and I also have this __CF_USER_TEXT_ENCODING environment variable set in my development environment (plan9port Acme).

This does not tell us anything new but it corroborates what Russ wrote above.

@rsc
Copy link
Contributor Author

rsc commented Nov 18, 2022

@gopherbot please backport

@gopherbot
Copy link

Backport issue(s) opened: #56836 (for 1.18), #56837 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link

Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround

@gopherbot
Copy link

Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot pushed a commit that referenced this issue Dec 22, 2022
Revert CL 451735 (1f4394a), which fixed #33565 and #56784
but also introduced #57263.

I have a different fix to apply instead. Since the first fix was
never backported, it will be easiest to backport the new fix
if the new fix is done in a separate CL from the revert.

Change-Id: I6c8ea3a46e542ee4702675bbc058e29ccd2723e0
Reviewed-on: https://go-review.googlesource.com/c/go/+/459175
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin

@gopherbot
Copy link

Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin

gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.

Fixes #56837.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/go/+/459178
gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.
Fixes #56836.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/go/+/459179
@gopherbot
Copy link

Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3

gopherbot pushed a commit that referenced this issue Jan 10, 2023
CL 451735 worked around bugs in Apple's atfork handlers by calling
notify_is_valid_token and xpc_atfork_child at startup, so that init
code that wouldn't be safe in the child process would be warmed up in
the parent process instead, but xpc_atfork_child broke use of the xpc
library in Go programs, and xpc is internally used by various macOS
frameworks (#57263).

CL 459175 reverted that change, and then CL 459176 tried a new
approach: use __fork, which doesn't call any of the atfork handlers at all.
That worked, but an Apple engineer reviewing the change in private
email suggests that since __fork is not public API, it should be avoided.
The same engineer (with access to the source code for the xpc library)
suggests that the breakage in #57263 is caused by xpc_atfork_child
marking the library as unusable, expecting an imminent call to exec,
and that calling xpc_date_create_from_current instead would do the
necessary initialization without marking xpc as unusable.

CL 460475 reverted that change, to prepare for this one.

This CL goes back to the original “call functions to warm things up”
approach, replacing xpc_atfork_child with xpc_date_create_from_current.

The CL also updates cmd/link to use OS and SDK version 10.13.0 for
x86 macOS binaries, up from 10.9.0, also suggested by the Apple engineer.
Combined with the two warmup calls, this makes the fork hangs go away.
The minimum macOS version has been 10.13 High Sierra since Go 1.17,
so there should be no problem with writing that in the binaries too.

Fixes #33565.
Fixes #56784.
Fixes #57263.
Fixes #57577.

Change-Id: I20769d9daa1fe9ea930f8009481335f8a14dc21b
Reviewed-on: https://go-review.googlesource.com/c/go/+/460476
Auto-Submit: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@dwlnetnl
Copy link

There is source code available for the Objective-C runtime on opensource.apple.com which redirects to https://github.com/apple-oss-distributions/objc4

@golang golang locked and limited conversation to collaborators Jan 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants