Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: software floating point for GOARM=6, 7 (not only GOARM=5) #61588

Closed
ludi317 opened this issue Jul 26, 2023 · 34 comments
Closed

runtime: software floating point for GOARM=6, 7 (not only GOARM=5) #61588

ludi317 opened this issue Jul 26, 2023 · 34 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest Proposal Proposal-Accepted
Milestone

Comments

@ludi317
Copy link
Contributor

ludi317 commented Jul 26, 2023

I want to run a go binary on an ARMv7 target that doesn't have a hardware floating point unit (FPU). (The ARMv7 specification does not require a hardware FPU; it is optional.) Currently, the only way to use software floating point on ARM targets is to set GOARM=5, regardless of the actual ARM version of the target, whether 5, 6, or 7. If the decision of using software or hardware floating point were decoupled from the ARM version, then there would be no need to fall back to the ARMv5 instruction set on ARMv7 chips lacking a hardware FPU.

I request a new go environment variable (perhaps GOARMFP=soft or hard) that could be used alongside GOARCH=arm and either GOARM=6 or GOARM=7 to specify software ("soft") or hardware ("hard") floating point. GOARM=5 would always imply software floating point.

Because this addresses an immediate business need, I have developed a working prototype for GOARM=7 with software floating point, and could make contributions toward this new setting.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 26, 2023
@cherrymui
Copy link
Member

You can try using go build -gcflags=all=-d=softfloat, which should make all compiled code using softfloat. There might be some assembly code that uses floating point, which you might need to rewrite.

@mknyszek mknyszek added this to the Backlog milestone Jul 26, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 26, 2023
@mknyszek mknyszek changed the title runtime: request for software floating point for GOARM=6, 7 (not only GOARM=5) proposal: runtime: software floating point for GOARM=6, 7 (not only GOARM=5) Jul 26, 2023
@mknyszek
Copy link
Contributor

In triage, we think this needs to be a proposal. Since this isn't explicitly supported (and we don't have hardware for CI to test this configuration, or a test to make sure there aren't any FP instructions when setting the softfloat configuration) we'd have to make a decision to support it.

@mknyszek mknyszek added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jul 26, 2023
@gopherbot gopherbot removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 26, 2023
@ianlancetaylor ianlancetaylor modified the milestones: Backlog, Proposal Jul 26, 2023
@gopherbot
Copy link

Change https://go.dev/cl/514907 mentions this issue: all: add GOARMFP env var for ARM floating point mode

@ludi317
Copy link
Contributor Author

ludi317 commented Aug 1, 2023

In this comment, another user is forced to downgrade to GOARM=5 on an ARMv7 chip just to get soft floating point (#58686 comment).

An ARMv7 chip should execute ARMv7 instructions. Anything less leaves the CPU underutilized, and is a waste of resources.

To support this proposal, I have submitted a CL that can build GOARM=7 and GOARMFP=soft. Even if this proposal is not approved, I would greatly appreciate a review, or any feedback, on the CL. Thanks.

@cherrymui
Copy link
Member

@ludi317 Have you tried the compiler flag -gcflags=all=-d=softfloat? If there is some assembly code that needs to be adjusted we could introduce a macro like -D softfloat that you can pass as -asmflags.

If we really want an environment variable for the go command, my counter proposal: use an existing variable, either GOARM=7,softfloat (see also #60072), or GOEXPERIMENT=softfloat (our softfloat implementation is largely architecture independent (except a small amount of assembly code), so may as well use an architecture independent flag).

@ludi317
Copy link
Contributor Author

ludi317 commented Aug 3, 2023

@cherrymui I did try building with the compiler flag -gcflags=all=-d=softfloat (and commenting out this check). Unfortunately, the binary crashed with signal SIGILL. The assembly code does indeed need to be modified in a few places, as seen in my CL.

I am not particular about the API used to specify soft float for ARM, as long as there is one. If I were to choose, I'd suggest that if GOMIPS64 accepts a comma-separated list of options (as proposed in #60072), then it would make sense for GOARM to do the same. Your proposal to use GOARM=7,softfloat seems very reasonable.

@randall77
Copy link
Contributor

Can you tell us what this chip is that is armv7 but without floating point? I am curious.

Since you have the change prototyped, what performance differences are you seeing between GOARM=5 and GOARM=7,softfloat? In the compiler at least, the differences I see are mostly bit manipulation instructions (find first bit, etc.). There may be some more in the runtime (memmove?).

@MDr164
Copy link

MDr164 commented Aug 9, 2023

Can you tell us what this chip is that is armv7 but without floating point? I am curious.

The Aspeed AST2500 for example is a chip that supports the armv6k instruction set but does not have a floating point unit so we need to fall back to GOARM=5 for that one. Another one is the Broadcom BCM4708A0 armv7 SoC that lacks floating point hardware. In general a lot of the cheaper WiFi/AP/Network appliances or deeply embedded SoCs often come without an fpu as it's often times not really needed for the limited usecase of the system.

@ludi317
Copy link
Contributor Author

ludi317 commented Aug 9, 2023

Can you tell us what this chip is that is armv7 but without floating point? I am curious.

The chip is a BCM56160, and is found in a network switch. sysctl shows that the CPU is an ARM Cortex-A9, without an FPU:

root@martini48t-p2a-sys04:RE:0% sysctl hw.model hw.floatingpoint
hw.model: ARM Cortex-A9 r4p1 (ECO: 0x00000000)
hw.floatingpoint: 0

Since you have the change prototyped, what performance differences are you seeing between GOARM=5 and GOARM=7,softfloat?

I never measured the performance of our Go program when GOARM=5. Since the network switch is already CPU-bound, I was concerned that downgrading would only hurt performance.

In the compiler at least, the differences I see are mostly bit manipulation instructions (find first bit, etc.). There may be some more in the runtime (memmove?).

Yes, the runtime leverages ARMv7 features. One example is that when GOARM=7, the runtime opts for ARM-specific atomic operations (armCas64, armXadd64, armXchg64, armLoad64, armStore64).

MOVB runtime·goarm(SB), R11
CMP $7, R11
BLT 2(PC)
JMP armCas64<>(SB)
JMP ·goCas64(SB)

FWIW, the prototype has matured into a feature implementation that takes GOARM=7,softfloat as an argument. Using this new option, we have built binaries that work as expected on the switch. Please see the CL for the implementation.

Finally, I came across a comment from Russ Cox indicating that back in 2011, Go supported software floating point for GOARM > 5, by setting the -F flag.

@cherrymui
Copy link
Member

Finally, I came across a comment from Russ Cox indicating that back in 2011, Go supported software floating point for GOARM > 5, by setting the -F flag.

The softfloat support in Go has been reworked since then. We used to handle it in the linker (5l at the time), at instruction level, which means it would also handle (Go) assembly code (but not cgo). Now we handle it in the compiler, with -gcflags=-d=softfloat, which means it doesn't handle assembly code. So we need a way for that.

@randall77
Copy link
Contributor

I'd really like to see some performance numbers of the difference between GOARM=5 and GOARM=7,softfloat. If there is little or no difference the whole point of this proposal is kind of moot.
It doesn't have to be on these strange chips. Any GOARM=7 capable chip could run some benchmarks in both modes and see. (You'd need to patch in the proposed CL for 7,softfloat support.)

@ludi317
Copy link
Contributor Author

ludi317 commented Aug 26, 2023

@randall77 Please find the requested benchmarks comparing GOARM=5 and GOARM=7,softfloat below. Full source code here.

The benchmarks show many significant performance improvements, and only a few minor degradations. On the AtomicOperationsInt64 benchmark, GOARM=7,softfloat is more than 3x faster than GOARM=5 .

goarch: arm
pkg: github.com/ludi317/arm-wrestle
                                  │ armv5_1cpu_raw.txt │       armv7soft_1cpu_raw.txt       │
                                  │       sec/op       │   sec/op     vs base               │
Float32Arithmetic                          4.944µ ± 1%   4.678µ ± 0%   -5.37% (p=0.002 n=6)
Int32Arithmetic                            15.67n ± 3%   15.65n ± 0%        ~ (p=0.318 n=6)
Float64Arithmetic                          3.905µ ± 0%   3.876µ ± 0%   -0.74% (p=0.002 n=6)
Int64Arithmetic                            29.06n ± 0%   29.07n ± 0%   +0.03% (p=0.015 n=6)
ANDconstBICconst                           52.53n ± 0%   52.55n ± 0%   +0.03% (p=0.035 n=6)
Uint64Move                                 22.35n ± 0%   22.36n ± 0%        ~ (p=1.000 n=6)
ADD                                        1.049µ ± 0%   1.009µ ± 0%   -3.81% (p=0.002 n=6)
ADDBICconst                                20.12n ± 0%   19.00n ± 0%   -5.57% (p=0.002 n=6)
ADDBICconstInt64                           29.07n ± 0%   27.94n ± 0%   -3.87% (p=0.002 n=6)
WithMulDAndMulF                           1029.0n ± 0%   986.2n ± 0%   -4.16% (p=0.002 n=6)
BitwiseInt32                               8.942n ± 0%   8.942n ± 0%        ~ (p=0.773 n=6)
BitwiseInt64                               13.42n ± 0%   13.42n ± 0%        ~ (p=1.000 n=6)
TrailingZeros                              43.59n ± 0%   30.18n ± 0%  -30.76% (p=0.002 n=6)
ProducerConsumerBufferedCh                 3.894µ ± 0%   3.603µ ± 0%   -7.46% (p=0.002 n=6)
ProducerConsumerBufferedChInt64            3.961µ ± 1%   3.631µ ± 0%   -8.33% (p=0.002 n=6)
ProducerConsumerUnBufferedCh               5.099µ ± 0%   4.701µ ± 0%   -7.81% (p=0.002 n=6)
ProducerConsumerUnBufferedChInt64          5.073µ ± 0%   4.634µ ± 0%   -8.65% (p=0.002 n=6)
GetCntxct                                  3.851µ ± 0%   3.578µ ± 0%   -7.10% (p=0.002 n=6)
CASInt32                                   158.9n ± 0%   160.9n ± 0%   +1.26% (p=0.002 n=6)
CASInt64                                   502.1n ± 0%   157.3n ± 3%  -68.66% (p=0.002 n=6)
CASUint64                                  502.1n ± 0%   157.5n ± 0%  -68.64% (p=0.002 n=6)
CASUint32                                  158.9n ± 0%   166.7n ± 0%   +4.91% (p=0.002 n=6)
CASUintptr                                 158.9n ± 0%   167.8n ± 3%   +5.60% (p=0.002 n=6)
AtomicOperationsInt64                      931.1n ± 0%   268.6n ± 0%  -71.15% (p=0.002 n=6)
AtomicOperationsInt32                      306.6n ± 0%   297.6n ± 0%   -2.92% (p=0.002 n=6)
AtomicOperationsUint64                     928.8n ± 0%   270.8n ± 0%  -70.84% (p=0.002 n=6)
AtomicOperationsUint32                     306.6n ± 0%   297.6n ± 0%   -2.92% (p=0.002 n=6)
AtomicOperationsUintptr                    308.8n ± 0%   304.4n ± 0%   -1.42% (p=0.002 n=6)
AtomicOperationsBool                       537.1n ± 0%   494.5n ± 0%   -7.93% (p=0.002 n=6)
geomean                                    300.4n        245.5n       -18.28%

@randall77
Copy link
Contributor

So it looks like math/bits and 64-bit atomics are the regressions.

The math/bits one is pretty minor, GOARM=5 is missing the RBIT instruction so getting trailing bits takes 2 more instructions. I think ReverseBytes is similar. (Reverse32 should be a lot faster on GOARM=7, but no one has optimized that function to use RBIT.)

The 64-bit atomic costs are more substantial. The arm atomics already do a runtime check, but they just use the GOARM value the binary was built with. If we can detect the presence of the atomic instructions we need (LDREXD/STREXD, maybe also DMB?) at runtime, then we can base the runtime check on the actual hardware we're running on.

@randall77
Copy link
Contributor

LDREXTD/STREXD can be detected using the lpae feature bit. (Particularly, detecting that they will be 64-bit atomic.)
It looks like we also need to make sure the DMB instruction is available. It is only available starting in v7, so we need a way to detect that the chip is v7. Anyone know how to get that from feature bits? Currently we check vfp and vfpv3, but of course that's too strict if we're trying to run on fp-less chips.

@ludi317
Copy link
Contributor Author

ludi317 commented Aug 30, 2023

@randall77 I thought the performance deltas in the channel-backed ProducerConsumer benchmarks (-8%) were also interesting, even though they were not as large as those of the math/bits and 64-bit atomic benchmarks.

Based on that finding, I wrote more benchmarks to compare the performance of synchronization primitives between the two builds. Please find the results below. The Mutex benchmarks that acquire a mutex lock, do some work, then release the lock are ~2x faster on GOARM=7,softlfloat.

goos: linux
goarch: arm
pkg: github.com/ludi317/arm-wrestle
                                  │ armv5_1cpu_raw.txt │       armv7soft_1cpu_raw.txt       │
                                  │       sec/op       │   sec/op     vs base               │
                                  ...
Mutex                                      44.94µ ± 0%   22.60µ ± 0%  -49.70% (p=0.002 n=6)
RWMutex_Read                               45.00µ ± 0%   22.65µ ± 0%  -49.67% (p=0.002 n=6)
RWMutex_Write                              45.22µ ± 0%   22.87µ ± 0%  -49.42% (p=0.002 n=6)
WaitGroup                                  90.63m ± 4%   77.13m ± 4%  -14.89% (p=0.002 n=6)
Channel                                    8.781m ± 0%   8.383m ± 0%   -4.54% (p=0.002 n=6)
AtomicAdd                                 259.40n ± 0%   73.86n ± 0%  -71.53% (p=0.002 n=6)
Once                                       67.11n ± 0%   64.87n ± 0%   -3.33% (p=0.002 n=6)
Cond                                      11.126µ ± 0%   9.781µ ± 0%  -12.09% (p=0.002 n=6)
Pool                                       774.5n ± 1%   723.2n ± 1%   -6.62% (p=0.002 n=6)

@randall77
Copy link
Contributor

I suspect that the channel differences are all due to the synchronization primitives that channels use, for which we know there is already a sizable performance difference.

@gopherbot
Copy link

Change https://go.dev/cl/525637 mentions this issue: runtime: on arm32, detect whether we have sync instructions

@rsc
Copy link
Contributor

rsc commented Nov 2, 2023

Thanks for the numbers showing that 7,softfloat is still better than 5 with checks.

@rsc
Copy link
Contributor

rsc commented Nov 2, 2023

Have all remaining concerns about this proposal been addressed?

GOARM changes to have the form [567](,attrs)?.
That is, there is now an optional attribute list.
The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS).
It is an error to specify both softfloat and hardfloat.
The leading number cannot be omitted.
softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.

@MDr164
Copy link

MDr164 commented Nov 3, 2023

Looks good, looking forward to create some real-world benchmarks as this feature might greatly boost performance due to being finally able to use the v6 and v7 ISA on non-FP chips 🎉
I'm also in favor of the new optional attribute as this allows aot compilation with optimized asm instead of autodetection via cpu feature bits which aren't always reliable. And it keeps code size down.

@rsc
Copy link
Contributor

rsc commented Nov 10, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

GOARM changes to have the form [567](,attrs)?.
That is, there is now an optional attribute list.
The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS).
It is an error to specify both softfloat and hardfloat.
The leading number cannot be omitted.
softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.

@cherrymui
Copy link
Member

I assume GOARM=5,hardfloat will be an unsupported configuration?

@MDr164
Copy link

MDr164 commented Nov 12, 2023

I assume GOARM=5,hardfloat will be an unsupported configuration?

To quote Ludi from earlier:

For example, one advantage of the proposed softfloat / hardfloat naming scheme is that it is expressive enough to select GOARM=5,hardfloat and redress another fallback case. This is not to say GOARM=5,hardfloat ought to be implemented, only that the options generalize well enough to permit the possibility.

So I'd say GOARM=5,hardfloat should be generally supported as VFP is technically supported on ARMv5 but I never came accross a chip that actually implements this combination (while the other way around, having a higher ISA but no VFP, is more common than one might think). And to streamline the flags and quote Russ:

GOARM changes to have the form [567](,attrs)?. [...] softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

So there should not be a difference of attrs supported for each number imo.

@rsc
Copy link
Contributor

rsc commented Nov 14, 2023

I think it's fine to support 5,hardfloat and easier to support it than to reject it. Maybe people on chips with broken atomics will want it.

@ludi317
Copy link
Contributor Author

ludi317 commented Nov 14, 2023

I updated my CL to support GOARM=5,hardfloat. I marked the parts of code that require the eye of a Go compiler team member as todo. I assume it's too late for this change to make it into the upcoming 1.22 release?

@randall77
Copy link
Contributor

It is not too late yet. The freeze is Nov 21.

@rsc
Copy link
Contributor

rsc commented Nov 16, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

GOARM changes to have the form [567](,attrs)?.
That is, there is now an optional attribute list.
The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS).
It is an error to specify both softfloat and hardfloat.
The leading number cannot be omitted.
softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.

@rsc rsc changed the title proposal: runtime: software floating point for GOARM=6, 7 (not only GOARM=5) runtime: software floating point for GOARM=6, 7 (not only GOARM=5) Nov 16, 2023
@rsc rsc modified the milestones: Proposal, Backlog Nov 16, 2023
gopherbot pushed a commit that referenced this issue Nov 20, 2023
This change introduces new options to set the floating point
mode on ARM targets. The GOARM version number can optionally be
followed by ',hardfloat' or ',softfloat' to select whether to
use hardware instructions or software emulation for floating
point computations, respectively. For example,
GOARM=7,softfloat.

Previously, software floating point support was limited to
GOARM=5. With these options, software floating point is now
extended to all ARM versions, including GOARM=6 and 7. This
change also extends hardware floating point to GOARM=5.

GOARM=5 defaults to softfloat and GOARM=6 and 7 default to
hardfloat.

For #61588

Change-Id: I23dc86fbd0733b262004a2ed001e1032cf371e94
Reviewed-on: https://go-review.googlesource.com/c/go/+/514907
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
@MDr164
Copy link

MDr164 commented Nov 20, 2023

The CL has been merged, I guess this can be marked as resolved then?

@dmitshur dmitshur modified the milestones: Backlog, Go1.22 Nov 20, 2023
@cherrymui
Copy link
Member

I think this is done. Thank you!

krox2 added a commit to TransFICC/wireguard-go that referenced this issue Jan 19, 2024
our switches don't have a hardware floating point unit which is set by default when building arm v7 or v6. It seems that setting arm to v5 would also work 
golang/go#61588
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest Proposal Proposal-Accepted
Projects
Status: Accepted
Development

No branches or pull requests

9 participants