Proposal: profile-guided optimization

Authors: Cherry Mui, Austin Clements, Michael Pratt

Last updated: 2022-09-12

Discussion at https://golang.org/issue/55022.
Previous discussion at https://golang.org/issue/28262.

Abstract

We propose adding support for profile-guided optimization (PGO) to the Go gc toolchain. PGO will enable the toolchain to perform application- and workload-specific optimizations based on run-time information. Unlike many compiler optimizations, PGO requires user involvement to collect profiles and feed them back into the build process. Hence, we propose a design that centers user experience and ease of deployment and fits naturally into the broader Go build ecosystem.

Our proposed approach uses low-overhead sample-based profiles collected directly from production deployments. This ensures profile data is representative of an application’s real workload, but requires the Go toolchain to cope with stale profiles and profiles from already-optimized binaries. We propose to use the standard and widely-deployed runtime/pprof profiler, so users can take advantage of robust tooling for profile collection, processing, and visualization. Pprof is also supported on nearly every operating system, architecture, and deployment environment, unlike hardware-based profiling, which is higher fidelity but generally not available in cloud environments. Users will check in profiles to source control alongside source code, where the go tool can transparently supply them to the build and they naturally become part of reproducible builds and the SBOM. Altogether, Go’s vertical integration from build system to toolchain to run-time profiling creates a unique opportunity for a streamlined PGO user experience.

Background

Profile-guided optimization (PGO), also known as feedback-driven optimization (FDO), is a powerful optimization technique that uses profiles of run-time behavior of a program to guide the compiler’s optimizations of future builds of that program. This technique can be applied to other build stages as well, such as source-code generation, link time or post-link time (e.g. LTO, BOLT, Propeller), and even run time.

PGO has several advantages over traditional, heuristic-based optimization. Many compiler optimizations have trade-offs: for example, inlining improves performance by reducing call overhead and enabling other optimizations; but inlining also increases binary size and hence I-cache pressure, so too much inlining can harm overall performance. Optimization heuristics aim to balance these trade-offs, but rarely achieve the right balance for peak performance of any particular application. Using profile data collected at run time, the compiler has information that is impossible to derive statically because it depends on an application‘s workload, or is simply too costly to compute within a reasonable compile time budget. Sometimes users turn to source-code level compiler directives such as “inline” directives (issue 21536) to guide the optimizations. However, source-code level compiler directives also do not work well in all situations. For example, a library author would want to mark the functions that are important to the performance of that library for inlining. But to a program, that library may not be performance critical. If we inline all important functions in all libraries, it will result in slow builds, binary size blow-up, and perhaps slower run-time performance. PGO, on the other hand, can use information about the whole program’s behavior, and apply optimizations to only the performance-critical part of the program.

Related work

Various compilers for C/C++ and other languages support instrumentation-based PGO, for example, GCC‘s -fprofile-generate and -fprofile-use options, and LLVM’s -fprofile-instr-generate and -fprofile-instr-use options.

GCC and LLVM also support sample-based PGO, such as GCC‘s -fauto-profile option and LLVM’s -fprofile-sample-use option. They expect profiles collected from Linux perf and then converted to the GCC or LLVM format using the AutoFDO tool. For LLVM, LBR profiles are recommended but not strictly required.

Google‘s AutoFDO (for C/C++ programs) is built on LLVM’s sample-based PGO, along with other toolings and mechanisms such as Google-wide profiling (GWP). AutoFDO improves the performance of C/C++ programs by 5–15% in Google datacenters.

Profile data is also used in various link-time and post link-time optimizers, such as GCC's LIPO, LLVM ThinLTO, BOLT, and Propeller.

Discussion

AutoFDO vs. instrumentation-based FDO

In traditional, instrumentation-based FDO, developers use the following process:

  1. Build the binary with compiler-inserted instrumentation to collect call and branch edge counts
  2. Run a set of benchmarks with instrumentation to collect the profiles
  3. Build the final binary based on the profiles

This process looks simple and straightforward. It generally does not require any special tooling support (besides the instrumentation and the optimizations in the compiler) because the profile is used immediately for the optimized build. With instrumentation, it is relatively easy to collect a broad range of data beyond branches, for example, specific values of variables or function parameters in common cases.

But it has a few key drawbacks. The instrumentation typically has a non-trivial overhead, making it generally infeasible to run the instrumented programs directly in production. Therefore, this approach requires high-quality benchmarks with representative workloads. As the source code evolves, one needs to update the benchmarks, which typically requires manual work. Also, the workload may shift from time to time, making the benchmarks no longer representative of real use cases. This workflow may also require more manual steps for running the benchmarks and building the optimized binary.

To address the issues above, AutoFDO is a more recent approach. Instead of instrumenting the binary and collecting profiles from special benchmark runs, sample-based profiles are collected directly from real production uses using regular production builds. The overhead of profile collection is low enough that profiles can be regularly collected from production. A big advantage of this approach is that the profiles represent the actual production use case with real workloads. It also simplifies the workflow by eliminating the instrumentation and benchmarking steps.

The AutoFDO style workflow imposes more requirements on tooling. As the profiles are collected from production binaries, which are already optimized (even with PGO), it may have different performance characteristics from a non-PGO binary and the profiles may be skewed. For example, a profile may indicate a function is “hot”, causing the compiler to optimize that function such that it no longer takes much time in production. When that binary is deployed to production, the profile will no longer indicate that function is hot, so it will not be optimized in the next PGO build. If we apply PGO iteratively, the performance of the output binaries may not be stable, resulting in “flapping” ​​[Chen ‘16, Section 5.2]. For production binaries it is important to have predictable performance, so we need to maintain iterative stability.

Also, while a binary is running in production and profiles are being collected, there may be active development going on and the source code may change. If profiles collected from the previous version of the binary cannot be used to guide optimizations for a new version of the program, deploying a new version may cause performance degradation. Therefore it is a requirement that profiles be robust to source code changes, with minimum performance degradation. Finally, the compiler may change from time to time. Similarly, profiles collected with binaries built with the previous version of the compiler should still provide meaningful optimization hints for the new version of the compiler.

It is also possible to run a traditional FDO-style build using AutoFDO. To do so, one does not need to instrument the binary, but just run the benchmarks with sample-based profiling enabled. Then immediately use the collected profiles as input to the compiler. In this case, one can use the tooling for AutoFDO, just with profiles from the benchmarks instead of those from production.

As the tooling for AutoFDO is more powerful, capable of handling most of the manual-FDO style use cases, and in some circumstances greatly simplifies the user experience, we choose the AutoFDO style as our approach.

Requirements

Reproducible builds

The Go toolchain produces reproducible builds; that is, the output binary is byte-for-byte identical if the inputs to a build (source code, build flags, some environment variables) are identical. This is critical for the build cache to work and for aspects of software supply-chain security, and can be greatly helpful for debugging.

For PGO builds we should maintain this feature. As the compiler output depends on the profiles used for optimization, the content of the profiles will need to be considered as input to the build, and be incorporated into the build cache key calculation.

For one to easily reproduce the build (for example, for debugging), the profiles need to be stored in known stable locations. We propose that, by default, profiles are stored in the same directory as the main package, alongside other source files, with the option of specifying a profile from another location.

Stability to source code and compiler changes

As discussed above, as the source code evolves the profiles could become stale. It is important that this does not cause significant performance degradation.

For a code change that is local to a small number of functions, most functions are not changed and therefore profiles can still apply. For the unchanged functions their absolute locations may change (such as line number shifts or code moving). We propose using function-relative line numbers so PGO tolerates location shifts. Using only the function name and relative line number also handles source file renames.

For functions that are actually changed, the simplest approach is to consider the old profile invalid. This would cause performance degradation but it is limited to a single function level. There are several possible approaches to detecting function changes, such as requiring access to the previous source code, or recording a hash of each function AST in the binary and copying it to the profile. With more detailed information, the compiler could even invalidate the profiles for only the sub-trees of the AST that actually changed. Another possibility is to not detect source code changes at all. This can lead to suboptimal optimizations, but AutoFDO showed that this simple solution is surprisingly effective because profiles are typically flat and usually not all hot functions change at the same time.

For large-scale refactoring, much information, such as source code locations and names of many functions, can change at once. To avoid invalidating profiles for all the changed functions, a tool could map the old function names and locations in the profile to a new profile with updated function names and locations.

Profile stability across compiler changes is mostly not a problem if profiles record source level information.

Iterative stability

Another aspect of stability, especially with the AutoFDO approach, is iterative stability, also discussed above. Because we expect PGO-optimized binaries to be deployed to production and also expect the profiles that drive a PGO build to come from production, it’s important that we support users collecting profiles from PGO-optimized binaries for use in the next PGO build. That is, with an AutoFDO approach, the build/deploy/profile process becomes a closed loop. If we’re not careful in the implementation of PGO, this closed loop can easily result in performance “flapping”, where a profile-driven optimization performed in one build interferes with the same optimization happening in the next build, causing performance to oscillate.

Based on the findings from AutoFDO, we plan to tackle this on a case-by-case basis. For PGO-based inlining, since call stacks will include inlined frames, it’s likely that hot calls will remain hot after inlining. For some optimizations, we may simply have to be careful to consider the effect of the optimization on profiles.

Profile sources and formats

There are multiple ways to acquire a profile. Pprof profiles from the runtime/pprof package are widely used in the Go community. And the underlying implementation mechanisms (signal-based on UNIXy systems) are generally available on a wide range of CPU architectures and OSes.

Hardware performance monitors, such as last branch records (LBR), can provide very accurate information about the program, and can be collected on Linux using the perf command, when it is available. However, hardware performance monitors are not always available, such as on most of the non-x86 architectures, non-Linux OSes, and usually on cloud VMs.

Lastly, there may be use cases for customized profiles, especially for profiling programs' memory behavior.

We plan to initially support pprof CPU profiles directly in the compiler and build system. This has significant usability advantages, since many Go users are already familiar with pprof profiles and infrastructure already exists to collect, view, and manipulate them. There are also existing tools to convert other formats, such as Linux perf, to pprof. The format is expressive enough to contain the information we need; notably the function symbols, relative offsets and line numbers necessary to make checked-in profiles stable across minor source code changes. Finally, the pprof format uses protocol buffers, so it is highly extensible and likely flexible enough to support any custom profile needs we may have in the future. It has some downsides: it’s a relatively complex format to process in the compiler, as a binary format it’s not version-control friendly, and directly producing source-stable profiles may require more runtime metadata that will make binaries larger. It will also require every invocation of the compiler to read the entire profile, even though it will discard most of the profile, which may scale poorly with the size of the application. If it turns out these downsides outweigh the benefits, we can revisit this decision and create a new intermediate profile format and tools to convert pprof to this format. Some of these downsides can be solved transparently by converting the profile to a simpler, indexed format at build time and storing this processed profile in the build cache.

Proposal

We propose to add profile-guided optimization to Go.

Profile sources and formats

We will initially support pprof CPU profiles. Developers can collect profiles through usual means, such as the the runtime/pprof or net/http/pprof packages. The compiler will directly read pprof CPU profiles.

In the future we may support more types of profiles (see below).

Changes to the go command

The go command will search for a profile named default.pgo in the source directory of the main package and, if present, will supply it to the compiler for all packages in the transitive dependencies of the main package. The go command will report an error if it finds a default.pgo in any non-main package directory. In the future, we may support automatic lookup for different profiles for different build configurations (e.g., GOOS/GOARCH), but for now we expect profiles to be fairly portable between configurations.

We will also add a -pgo=<path> command line flag to go build that specifies an explicit profile location to use for a PGO build. A command line flag can be useful in the cases of

  • a program with the same source code has multiple use cases, with different profiles
  • build configuration significantly affects the profile
  • testing with different profiles
  • disabling PGO even if a profile is present

Specifically, -pgo=<path> will select the profile at path, -pgo=auto will select the profile stored in the source directory of the main package if there is one (otherwise no-op), and -pgo=off will turn off PGO entirely, even if there is a profile present in the main package's source directory.

For Go 1.20, it will be default to off, so in a default setting PGO will not be enabled. In a future release it will default to auto.

To ensure reproducible builds, the content of the profile will be considered an input to the build, and will be incorporated into the build cache key calculation and BuildInfo.

go test of a main package will use the same profile search rules as go build. For non-main packages, it will not automatically provide a profile even though it’s building a binary. If a user wishes to test (or, more likely, benchmark) a package as it is compiled for a particular binary, they can explicitly supply the path to the main package’s profile, but the go tool has no way of automatically determining this.

Changes to the compiler

We will modify the compiler to support reading pprof profiles passed in from the go command, and modify its optimization heuristics to use this profile information. This does not require a new API. The implementation details are not included in this proposal.

Initially we plan to add PGO-based inlining. More optimizations may be added in the future.

Compatibility

This proposal is Go 1-compatible.

Implementation

We plan to implement a preview of PGO in Go 1.20.

Raj Barik and Jin Lin plan to contribute their work on the compiler implementation.

Future work

Profile collection

Currently, the runtime/pprof API has limitations, is not easily configurable and not extensible. For example, setting the profile rate is cumbersome (see issue 40094). If we extend the profiles for PGO (e.g. adding customized events), the current API is also insufficient. One option is to add an extensible and configurable API for profile collection (see issue 42502). As PGO profiles may be beyond just CPU profiles, we could also have a “collect a PGO profile” API, which enables a (possibly configurable) set of profiles to collect specifically for PGO.

The net/http/pprof package may be updated to include more endpoints and handlers accordingly.

We could consider adding additional command line flags to go test, similar to -cpuprofile and -memprofile. However, go test -bench is mostly for running micro-benchmarks and may not be a desired usage for PGO. Perhaps it is better to leave the flag out.

To use Linux perf profiles, the user (or the execution environment) will be responsible for starting or attaching perf. We could also consider collecting a small set of hardware performance counters that are commonly used and generally available in pprof profiles (see issue 36821).

Non-main packages

PGO may be beneficial to not only executables but also libraries (i.e. non-main packages). If a profile of the executable is present, it will be used for the build. If the main package does not include a profile, however, we could consider using the profiles of the libraries, to optimize the functions from those libraries (and their dependencies).

Details still need to be considered, especially for complex situations such as multiple non-main packages providing profiles. For now, we only support PGO at the executable level.

Optimizations

In the future we may add more PGO-based optimizations, such as devirtualization, stenciling of specific generic functions, basic block ordering, and function layout. We are also considering using PGO to improve memory behavior, such as improvements on the escape analysis and allocations.