Skip to content

proposal: runtime: CPU limit-aware GOMAXPROCS default #73193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
prattmic opened this issue Apr 7, 2025 · 35 comments
Open

proposal: runtime: CPU limit-aware GOMAXPROCS default #73193

prattmic opened this issue Apr 7, 2025 · 35 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Implementation Issues describing a semantics-preserving change to the Go implementation. Proposal
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented Apr 7, 2025

Overview

Change the Go runtime on Linux to use CPU cgroup quota limits to set the default value of GOMAXPROCS.

This is a concrete proposal for the ideas discussed in #33803. I've included a lot of background, but you can jump directly to the proposal.

Background

Go

GOMAXPROCS specifies the maximum number of goroutines that may be run in parallel. In the implementation, this corresponds to the maximum number of system threads that will execute a goroutine at a given time. In other words, GOMAXPROCS specifies the maximum parallelism of a Go program.

Note that GOMAXPROCS does not apply to threads created by (and running in) C, or to “blocking” system calls or cgo calls1. Thus the actual maximum parallelism of a Go process may exceed GOMAXPROCS.

Linux

The Linux kernel has several different mechanisms that impact thread scheduling.

The most fundamental is the actual number of logical CPUs in the machine. That is, all CPUs, counting hyperthreads, if any. This defines the maximum parallelism of the machine itself, as the Linux kernel can only run one thread on a CPU at a time.

Closely related is the CPU affinity mask of a process, set by sched_setaffinity(2). This specifies the set of logical CPUs that a given process is allowed to run on. The Linux kernel will never schedule any threads on CPU not in the mask, even if they are idle. This provides a mechanism to limit the maximum parallelism of a process by reducing its available CPU set. Unfortunately, this is fairly rudimentary as it requires the user to manually allocate CPU resources, which may result in subpar overall utilization.

Both of these mechanisms directly correspond to maximum parallelism, so they are the basis for the GOMAXPROCS default today. GOMAXPROCS defaults to either the total number of logical CPUs on the machine, or the number of logical CPUs available in the CPU affinity mask, whichever is lower.

Linux’s CPU cgroups provide additional scheduling controls commonly used by container runtimes/orchestration systems such as Docker or Kubernetes. Note that Linux has both v1 and v2 cgroups. Both provide the same CPU controls, just with slightly different names.

The first is cpu.cfs_quota_us / cpu.cfs_period_us (v1) or cpu.max (v2). This defines the maximum CPU time the cgroup may use within some period window. For example, a typical period is 100ms. If the quota is set to 800ms, then the cgroup may use 800ms of CPU time every 100ms of wall time. The simple case here would be that 8 threads can run in parallel, as each uses 100ms of CPU time per 100ms of wall time.

However, note that this is not a limit on maximum parallelism. For instance, it is also allowed for 16 threads to run for 50ms and then do nothing for 50ms. This allows bursts of higher parallelism, provided the cgroup uses less CPU later in the period. If a cgroup exceeds its quota, all threads in the cgroup are descheduled until the end of the period.

The bursting behavior makes this not quite a perfect match for GOMAXPROCS, as GOMAXPROCS does not allow bursts, however this is otherwise conceptually similar to GOMAXPROCS.

Note that cgroups are hierarchical, so the effective quota for a cgroup is the minimum quota of any cgroup up the hierarchy, assuming the periods are all identical.

Second, there is cpu.shares (v1) or cpu.weights (v2). These do not set hard limits at all, but are instead relative priorities given to the kernel scheduler. i.e., threads in cgroup A with double the shares of cgroup B are twice as likely to run.

Container runtimes will use this to set relative priorities of different containers. For instance, suppose you have a machine with 8 logical CPUs, and two containers A and B. Both containers want to run 8 threads in parallel, but you would like A to use approximately 1 CPU and B to use approximately 7 CPUs. If you set A with shares of 1 and B with shares of 7, the scheduler will run 1 of A’s threads at a time, and 7 of B’s threads. However, the ratio only matters if the machine is overcommitted. If B is completely idle, A will run all 8 threads, as there is no hard limit on its CPU usage.

So, the relative values of CPU shares make them difficult to interpret without context, and the lack of an actual limit makes them a bad fit for GOMAXPROCS.

Note that both .NET and Java both use the CPU quota to determine CPU limits. Java originally also considered CPU shares, but reverted the change after realizing it is a bad fit.

Finally, cpuset.cpus defines the set of CPUs available for scheduling within the cgroup. This is equivalent to using sched_setaffinity(2), except that it applies on the cgroup level. The result of this configuration is visible to applications via sched_getaffinity(2).

Higher level

While the Linux kernel values are the primary things we have to work with, let’s take a look at how users actually configure these values.

Docker

Docker directly provides --cpu-quota, --cpu-period, and --cpu-shares flags which directly correspond to the cgroup options, as well as a slightly more general --cpus, which simply uses the CPU quota with a predefined period.

Kubernetes

Kubernetes is a bit higher level. Kubernetes containers specify CPU limits and requests. From the summary of how these work:

  • The CPU limit is “a hard ceiling on how much CPU time the container can use”. This corresponds almost directly to the CPU cgroup quota. Kubernetes selects a constant period (100ms, I believe), and scales the CPU limit to determine the quota. e.g., a CPU limit of 8 results in a quota of 800ms.
  • The CPU request is the “minimum required CPU” for a container. The Kubernetes pod scheduler will not overcommit nodes, ensuring there is at least as much as the requested CPU available for each container. In addition, Kubernetes will assign CPU shares to achieve appropriate weighting between different containers (like the shares example above). A container with only a request and no limit has no hard upper bound on CPU usage.

Historically, CPU requests and limits could not be changed after starting the container. However, Kubernetes has alpha support for “in place vertical scaling”, which allows changing the CPU request and limit while the container is running. This is scheduled to be promoted to beta in Kubernetes 1.33.

Similarly, historically CPU requests and limits applied only to individual containers. Kubernetes also has alpha support for “pod-level resource specification”, which allows specifying requests and limits for a pod, which is a group of one or more containers. There is no published target beta or GA schedule for this feature. I believe this results in a multi-level cgroup hierarchy.

A large fraction of Kubernetes users set CPU requests but not limits, with the intention that they want to consume otherwise idle cycles. Doing so should theoretically increase overall fleet CPU utilization, and also help avoid production issues caused by hitting CPU limits when there are sufficient idle cycles to handle the load.

Since CPU limits correspond directly to quotas, that makes them easy to work with.

CPU requests correspond indirectly to shares. If the Go runtime had access to the specific request value rather than the raw shares value, that resolves the problem of lack of context described above. However, the lack of an upper limit is still problematic. The CPU request is the minimum CPU. We could set GOMAXPROCS to this value, but that is overly conservative, preventing any use of parallelism beyond the minimum request. This is also in conflict with the users’ intention to use extra idle cycles over the request. We could set it to a higher value, but it is unclear how much higher we should go.

In addition to CPU requests and limits, Kubernetes also has an optional CPU Manager, which assigns exclusive CPUs to containers when enabled. These scheduling restrictions are applied via CPU cgroup cpuset.cpus configuration, which is visible to applications via sched_getaffinity(2).

Mesos

I am less familiar with Mesos, but their resource summary is similar to Kubernetes. Tasks can have CPU requests and limits, which translate to CPU cgroup shares and limits, respectively.

I do not know which configurations are most common with Mesos.

GOMAXPROCS default today

Since Go 1.5, GOMAXPROCS has had a default of “the number of logical CPUs available”. Concretely, this means either the total number of logical CPUs on the machine, or the number available in the CPU affinity mask, whichever is lower.

This default works well for single-tenant systems (where the Go program is approximately the only thing running), allowing full use of the machine’s parallelism without requiring any configuration from users.

For multi-tenant systems without isolation (e.g., a laptop running multiple active applications), this default also tends to work fairly well. If the application only receives, say, 50% of the machine’s maximum parallelism on average, it may be better to use a lower value of GOMAXPROCS. On the other hand, the other tenants could reduce their CPU usage at any time, allowing this application to use more. If we used a lower value of GOMAXPROCS, then our application could not utilize this newly available parallelism.

Many of us (@mknyszek, @aclements, @cherrymui, etc) have long brainstormed a mechanism to eliminate the need for GOMAXPROCS entirely and instead dynamically discover the available parallelism and make adjustments on the fly in order to better serve multi-tenant systems in cases like above. Unfortunately, we do not expect to achieve something like this in the near term.

Additionally, the current GOMAXPROCS defaults tend to work OK for multi-tenant systems with no isolation simply because it is uncommon for such systems to run extremely overcommitted, which is when the default performs the worst.

Multi-tenant systems with isolation are typically container orchestration systems like Kubernetes. These will run multiple applications on a single machine with some form of CPU isolation for each application (CPU cgroup quota or shares). For example, a 64 CPU machine may host 8 applications, each of which have a CPU quota of 8 CPUs.

With a default GOMAXPROCS of 64, but only 8 CPUs of (average) parallelism, these applications are quite mismatched from reality. Downsides from this mismatch include:

  • CPU quota throttling. An application with a CPU quota of 8 and GOMAXPROCS=64 can quickly hit its quota and throttle (all threads descheduled) until the end of the period, which causes direct latency impact. Note that if the application has 64 concurrently runnable goroutines then even with GOMAXPROCS=8 there will be latency impact from goroutines waiting for the Go runtime to schedule them. However, this may be preferred as it is more smooth compared to the hard cutoff of CPU quota throttling.
  • Latency impact and CPU quota throttling from GC. Like the rest of the runtime, the GC uses GOMAXPROCS as the source of available parallelism. When running, the GC targets using 25% of GOMAXPROCS to perform GC work. Generally, this means that 25% of GOMAXPROCS is used to run GC worker goroutines, and 75% of GOMAXPROCS is used to run standard application goroutines. Additionally, the GC runs “idle workers” on any remaining portion of GOMAXPROCS that otherwise have nothing to do. This causes two primary issues:
    • Major: For applications that generally remain under their CPU quota despite the high GOMAXPROCS (because they simply do not have too many goroutines running concurrently), the 25% GC worker target plus idle workers on remaining GOMAXPROCS will cause a large spike in work that causes the application to exceed its quota and get throttled when it otherwise would not have.
    • Minor: Actual thread scheduling warps the 25% target for applications above their quota. In the GOMAXPROCS=64 example, the Go runtime will target running 16 GC worker goroutines and 48 application goroutines across 64 threads. The Linux kernel only runs 8 threads at a time. On average, 25% of the running threads will be running GC workers, but because thread scheduling is arbitrary, at any time there may be significantly more than 25% GC workers running, up to 100% GC workers running and no application goroutines running, which would effectively be an unintentional “stop the world”.
  • Scalability costs. Running Go at higher GOMAXPROCS has a variety of scaling costs, from increased memory use due to additional caches, to increased coordinate costs between threads. These costs are generally worthwhile to achieve additional parallelism, but when there is a big mismatch between GOMAXPROCS and actual parallelism these costs are paid with no benefit.
  • Minor: Increased context switching costs. The Go runtime will run goroutines on 64 threads, which the Linux kernel will need to round-robin between the 8 CPUs available. This is an added cost when the Go runtime could do this scheduling itself. This can also lead to edge cases like the kernel descheduling a thread that is running a goroutine holding a mutex. No other goroutine will be able to acquire that mutex until the kernel runs the thread again so the holder can unlock the mutex.

Proposal

Given this background, my proposal is that:

  1. At startup, if GOMAXPROCS is not set in the environment, the Go runtime will determine:
    1. The total logical CPU count of the machine
    2. The number of available logical CPUs from sched_getaffinity(2)
    3. If the process is in a cgroup, the “adjusted” CPU limit.
      1. For each level of the cgroup hierarchy, compute the CPU limit as cpu.cfs_quota_us / cpu.cfs_period_us (or cgroup v2 equivalent).
      2. Take the minimum CPU limit in the hierarchy as the “effective” effective CPU limit.
      3. Compute the “adjusted” CPU limit as max(2, ceil(effective_cpu_limit)).
  2. The default value of GOMAXPROCS will be the minimum of (i), (ii), or (iii).
  3. A new API in package runtime, func SetDefaultGOMAXPROCS() sets GOMAXPROCS based on the default behavior and returns the new value.
  4. The Go runtime will automatically update GOMAXPROCS if the CPU affinity or cgroup CPU limit change. This is done with a low frequency scan of the current environment. Users may call SetDefaultGOMAXPROCS to manually trigger updates.

The Go runtime will support querying the CPU cgroup quota from either cgroups v1 or v2. Note that mixed v1 and v2 controllers are supported by Linux. Go should support those as well.

This change in behavior is controlled by a compatibility GODEBUG, cgroupgomaxprocs=1. This defaults to cgroupgomaxprocs=0 for older language versions. Thus, behavior changes only when upgrading the language version, not when upgrading the toolchain.

The updated and new documentation:

// GOMAXPROCS sets the maximum number of CPUs that can be executing
// simultaneously and returns the previous setting. If n < 1, it does not change
// the current setting.
//
// If the GOMAXPROCS environment variable is set to a positive whole number,
// GOMAXPROCS defaults to that value.
//
// Otherwise, the Go runtime selects an appropriate default value based on the
// number of logical CPUs on the machine, the process’s CPU affinity mask, and,
// on Linux, the process’s average CPU throughput limit based on cgroup CPU
// quota, if any.
//
// The Go runtime periodically updates the default value based on changes to
// the total logical CPU count, the CPU affinity mask, or cgroup quota. Setting
// a custom value with the GOMAXPROCS environment variable or by calling
// GOMAXPROCS disables automatic updates. The default value and automatic
// updates can be restored by calling [SetDefaultGOMAXPROCS].
//
// If GODEBUG=cgroupgomaxprocs=0 is set, GOMAXPROCS defaults to the value of
// [runtime.NumCPU] and does not perform automatic updating.
//
// The default GOMAXPROCS behavior may change as the scheduler improves. 
func GOMAXPROCS(n int) int

// SetDefaultGOMAXPROCS updates the GOMAXPROCS setting to the runtime
// default, as described by [GOMAXPROCS], ignoring the GOMAXPROCS
// environment variable.
//
// SetDefaultGOMAXPROCS can be used to enable the default automatic updating
// GOMAXPROCS behavior if it has been disabled by the GOMAXPROCS
// environment variable or a prior call to [GOMAXPROCS], or to force an immediate
// update if the caller is aware of a change to the total logical CPU count, CPU
// affinity mask or cgroup quota. 
func SetDefaultGOMAXPROCS()

Discussion

Given the details in the background, cgroup CPU shares are bad fit for GOMAXPROCS and cannot be used.

cgroup CPU quota is a much better fit, but is not perfect. The CPU quota allows bursting to higher parallelism, which GOMAXPROCS does not. By setting GOMAXPROCS to the quota, we potentially increase latency for bursty workloads. See the example below for a more thorough explanation.

While this is disappointing, I still believe this makes a better default than the current default of total CPU count, as it mitigates the numerous downsides of a GOMAXPROCS mismatch, which in my opinion are more extreme than lost burst ability. I do expect some workloads will want to manually increase GOMAXPROCS to allow bursts.

Additionally, the CPU quota and period may average to a fractional number of CPUs (e.g., 0.5 or 2.5). GOMAXPROCS does not allow fractional parallelism. If the quota is less than 1, we must round up. If the quota is greater than 1, we can either round up or down. Arguments can go either way. If we round down, we are very unlikely to exceed the limit, which avoids potential freezes for the remainder of the period. If we round up, we can get better utilization by fully using the quota, and this theoretically might better indicate to monitoring systems that we are starved of CPU. I’ve selected the latter approach of rounding up for consistency with the GOMAXPROCS=2 rationale below, but I don’t feel strongly. In particular, uber-go/automaxprocs#13 and #33803 (comment) makes reasonable arguments for rounding down based on the assumption that fractional requests are intended to support additional small processes outside of the Go application.

The CPU quota limit specifies a minimum of GOMAXPROCS=2. That is, with a quota less than or equal to 1, we will round up GOMAXPROCS all the way to 2. GOMAXPROCS=1 disables all parallelism in the Go scheduler, which can cause surprising effects like GC workers temporarily “pausing” the application while the Go runtime switches back and forth between application goroutines and GC worker goroutines. Additionally, I consider a CPU quota less than 1 to be an indication that a workload is bursty, since it must be to avoid hitting the limit. Thus we can take advantage of the bursty nature to allow the runtime itself to burst and avoid GOMAXPROCS=1 pitfalls. If the number of logical or affinity mask CPUs is 1, we will still set GOMAXPROCS=1, as there is definitely no additional parallelism available.

Currently, if you wish to reset GOMAXPROCS to the default value (such as to override the GOMAXPROCS environment variable), you use runtime.GOMAXPROCS(runtime.NumCPU()). Note that NumCPU already takes CPU affinity into account.

All inputs to runtime.GOMAXPROCS are already well defined (>0 set to the passed value, <=0 returns the current value), so I’ve defined a new function (runtime.SetDefaultGOMAXPROCS) which performs the lookup and updates GOMAXPROCS. I’m not particularly attached to this. Some alternatives include:

  1. Change runtime.NumCPU to also consider the CPU quota. This feels like a bad idea to me because creating per-CPU caches would need an actual CPU count, not a quota, but that is extremely niche. Also, this call is currently defined as never updating after process start. It would need to update if users want to use this to discover changes to the quota.
  2. Add a new runtime.CPUQuota API that just returns the quota. This is a bit more flexible, as the result can be used for other purposes. The main downside is that reimplementing the default GOMAXPROCS behavior is complicated. Something like: runtime.GOMAXPROCS(min(runtime.NumCPU(), max(2, math.Ceil(runtime.CPUQuota()))))
  3. Since extremely large values of GOMAXPROCS are not useful, we could define some constant runtime.DefaultGOMAXPROCS = math.Int32Max that could be passed to runtime.GOMAXPROCS(). This is questionably backwards compatible, but setting a huge GOMAXPROCS is likely so slow that I doubt anyone does so.

The runtime will automatically update GOMAXPROCS if the CPU quota (or affinity) changes to accommodate container runtimes that change limits online. In particular, it would be unfortunate and confusing if CPU limits on a container are increased but the Go application never used the additional resources.

Automatic updates to GOMAXPROCS consider changes to the CPU affinity mask in addition to cgroup limit. Theoretically affinity changes should be reflected in runtime.NumCPU, but that call is defined as never changing after startup, which is unfortunate.

Implementation note: The CPU cgroup configuration files (cpu.cfs_quota_us, etc) do not support an explicit notification mechanism (such as poll(2)) when their values change. The only usable notification mechanisms are vfs file watches like inotify(7) / fanotify(7). sched_getaffinity(2) also has no notification mechanism. As a result, my current thinking is that we will detect changes via low frequency reread of the files in sysmon. We will scan with a minimum period of 30s, up to the maximum sysmon period (1 minute, due to forced GCs).

Note that container runtimes often mount a limited cgroupfs revealing only the leaf cgroup. This would prevent the runtime from walking up the hierarchy to check for more restrictive parent cgroups. I suspect this is a minor issue, as expect it is rate to have more restrictive parent cgroups, as that doesn't have much utility. It may even be OK to simply ignore parent cgroups at all times.

A major downside of this proposal is that it has no impact on container runtimes users that set a CPU request but no limit. This is a very common configuration, and will have no change from the status quo, which is unfortunate (note that Uber’s automaxprocs also does nothing for those users). Still, this proposal is better for users that do set a limit, and should not impede future changes for users with only a request.

This proposal is primarily limited to Linux. If other OSes have similar CPU limit mechanisms, I think it would make sense to support those as well in future proposals. The automatic updates of GOMAXPROCS based on changes to CPU scheduling affinity will affect all OSes.

Today’s GOMAXPROCS default is conceptually close to static (“number of CPUs”). With this proposal, I see the runtime moving more towards a dynamic “the runtime selects a good value” approach, which I think is a beneficial move to make if we want to make additional changes in the future, such as eliminating a fixed GOMAXPROCS entirely.

Comparison to go.uber.org/automaxprocs

go.uber.org/automaxprocs is a popular package for automatically setting GOMAXPROCS for container workloads. This proposal is very similar to automaxprocs, effectively an upstream version. For completeness, the major differences between this proposal and automaxprocs are:

  • automaxprocs by default has a minimum GOMAXPROCS of 1. This proposal has a minimum of 2.
  • automaxprocs by default rounds fractional limits down. This proposal rounds up.
  • automaxprocs by default logs changes it makes with log.Printf. This proposal does not log.
  • automaxprocs is more configurable (minimum GOMAXPROCS, rounding, logging). None of these are configurable in this proposal.
  • automaxprocs does not automatically update GOMAXPROCS when the quota changes. This proposal does.
  • If the process is in a CPU cgroup, automaxprocs always uses the CPU quota to set GOMAXPROCS. This proposal will use the number of logical CPUs or CPUs available in sched_getaffinity(2) if those are less than the CPU quota.
  • automaxprocs does not appear to support mixed cgroup v1 and v2 controllers. This proposal does.

Open questions

Implementation

I intend to implement this proposal for Go 1.25 if it is accepted.

Appendix

GOMAXPROCS and CPU quota difference example

To illustration the potential downside of setting GOMAXPROCS to the CPU limit for bursty applications, consider an idealized example application:

  • The machine has 10 logical CPUs.
  • The application runs inside a CPU cgroup with quota = 200ms, period = 100ms.
  • The application receives 1 request every 100ms.
  • Each request requires 50ms of CPU time to complete, and the work is perfectly parallizable.

Without this proposal, GOMAXPROCS=10. When a request is received, the 50ms of work is spread across 10 goroutines and the request completes with a latency of 5ms. The application never exceeds the cgroup quota because it only uses 50ms of CPU time in each 100ms period.

With this proposal, GOMAXPROCS=2. When a request is received, the 50ms of work is spread across 2 goroutunes and the request completes with a latency of 25ms. The application again never exceeds the cgroup quota.

This is a fairly extreme example, with the application completely idle most of the time, but that isn't necessary to get subpar behavior.

The "CPU limit" that we compute from quota / period is simply the average parallelism available for an entire period. Any subsection of the period where an applications uses less than the average CPU limit provides headroom to use more CPU later in the same period. e.g., with a CPU limit of 4, if an application uses only 2 CPU for the first half of the period, it could use 6 CPU for the second half and still achieve an average of 4.

With GOMAXPROCS set to the average CPU limit, Go prevents using more than the average parallelism at any point, so any headroom gained from using less is simply wasted.

CPU request based GC CPU

This is out of scope for this proposal, but one partially-baked idea to help users with a CPU request but no limit is to leave GOMAXPROCS alone, but to restrict the GC’s CPU target and idle workers to the CPU request.

Today, while the GC is running, the Go runtime will use 25% of GOMAXPROCS as “dedicated” GC workers. i.e., it will run on 25% of Ps even if it needs to deschedule user goroutines to do so. Beyond that, it will run “idle” workers on every available idle P. This can cause big CPU spikes described above.

Instead, we could run dedicated workers on 25% of “CPU request” Ps, and only idle workers on Ps up to a total CPU usage of the CPU request. This would resolve the big spikes caused by the GC.

The primary risk of this change is if the GC can no longer keep up with the allocation rate of application, then the GC will force GC assists on user goroutines. This is most likely if the application is using more than its CPU request. One possible mitigation for this would be to adjust up the GC target CPU if the application is using more than the CPU request.

This depends on having a mechanism to determine what the container CPU request value is.

Note that this is an internal implementation detail, so it does not need to be a proposal.

cc @golang/runtime, @sywhang @chabbimilind (for automaxprocs), @thepudds (for previous prototype experience), @thockin (for Kubernetes)

Footnotes

  1. Go intends for system calls and cgo calls that remain on CPU to count towards GOMAXPROCS. However, it is difficult to efficiently determine if these calls are blocked (off CPU), so the runtime uses a heuristic that assumes that calls that take a long wall time are blocked.

@prattmic prattmic added compiler/runtime Issues related to the Go compiler and/or runtime. Proposal labels Apr 7, 2025
@prattmic prattmic changed the title runtime: CPU limit-aware GOMAXPROCS default proposal: runtime: CPU limit-aware GOMAXPROCS default Apr 7, 2025
@gopherbot gopherbot added this to the Proposal milestone Apr 7, 2025
@gabyhelp gabyhelp added the Implementation Issues describing a semantics-preserving change to the Go implementation. label Apr 7, 2025
@chabbimilind
Copy link

Since there isn't a strong opinion abound rounding up or down, https://pkg.go.dev/math#Round makes more sense to me.

@chabbimilind
Copy link

chabbimilind commented Apr 7, 2025

Although the following topic was considered out of the scope of this proposal, I think it is of critical importance in CPU allocation.

Instead, we could run dedicated workers on 25% of “CPU request” Ps, and only idle workers on Ps up to a total CPU usage of the CPU request. This would resolve the big spikes caused by the GC.

The primary risk of this change is if the GC can no longer keep up with the allocation rate of application, then the GC will force GC assists on user goroutines. This is most likely if the application is using more than its CPU request. One possible mitigation for this would be to adjust up the GC target CPU if the application is using more than the CPU request.

When GC bursts consume 100% of the CPU quota, auto/elastic instance scaling (which depend on P99 utilization behavior) end up allocating more service instances. Instead, if we run dedicated workers on 25% of “CPU request” Ps, and only idle workers on Ps up to a total CPU usage of the CPU request as suggested here, I expect many workloads to benefit by not going to 100% utilization.

One possible mitigation for this would be to adjust up the GC target CPU if the application is using more than the CPU request.

This is a good idea. In network congestion control, AIMD https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease is often found to be effective. We can additively increase GC target CPU and decrease multiplicatively to work for those applications where GC needs to catch up with the allocation rate.

@cpuguy83
Copy link

cpuguy83 commented Apr 7, 2025

Note: there's also the cpuset controller in the cgroup that limits which cores the program can be scheduled to.

@prattmic
Copy link
Member Author

prattmic commented Apr 7, 2025

@cpuguy83 Yes, I have a small note about this deep in the background. I probably should have included this in the Linux section as well.

In addition to CPU requests and limits, Kubernetes also has an optional CPU Manager, which assigns exclusive cores to containers when enabled. These scheduling restrictions are applied via CPU cgroup cpuset.cpus configuration, which is visible to applications via sched_getaffinity(2).

I believe that cpuset limits are visible in sched_getaffinity(2) (though I admit I haven't actually checked a test program). That means that the GOMAXPROCS default today already considers these limits (without auto-update), and with this proposal would get auto-updating as well.

@vboulineau
Copy link

If the process is in a cgroup, the “adjusted” CPU limit.
For each level of the cgroup hierarchy, compute the CPU limit as cpu.cfs_quota_us / cpu.cfs_period_us (or cgroup v2 equivalent).

I am curious if you have a plan to implement this. I think that in most containers cases, the process will only see its own leaf cgroup as /sys/fs/cgroup and it's not possible to walk up the hierarchy, but perhaps my view is incomplete?

@cpuguy83
Copy link

cpuguy83 commented Apr 7, 2025

If the process is in a cgroup, the “adjusted” CPU limit.
For each level of the cgroup hierarchy, compute the CPU limit as cpu.cfs_quota_us / cpu.cfs_period_us (or cgroup v2 equivalent).

I am curious if you have a plan to implement this. I think that in most containers cases, the process will only see its own leaf cgroup as /sys/fs/cgroup and it's not possible to walk up the hierarchy, but perhaps my view is incomplete?

Correct, typically one would not be able to walk up the cgroup tree, especially with cgroupv2 where cgroup namespaces are a must.

@thediveo
Copy link
Contributor

thediveo commented Apr 7, 2025

minor terminology note: Linux uses the word (logical) CPU for a seemingly independent execution unit, and core is already implying a particular chip topology. For instance, lscpu output. In contrast, a “core” in Linux kernel speak is an aggregation level where a core can contain several CPUs. In the end, “core” is more precisely used as a concept of physical topology.

@thediveo
Copy link
Contributor

thediveo commented Apr 7, 2025

Is it of any importance when the minimum for GOMAXPROC is 2 and the cpuset has only a single logical CPU, does this impact performance? We have real-world setups where dual cpus are divided into single cpu cpusets.

@rabbbit
Copy link

rabbbit commented Apr 7, 2025

Is the selected behavior of rounding fraction quota up the right choice? In particular, uber-go/automaxprocs#13 and #33803 (comment) makes reasonable arguments for rounding down based on the assumption that fractional requests are intended to support additional small processes outside of the Go application.

Uber continues to see this in production today. Many, many teams run with N.2, N.4 or N.8 CPU limits - the primary fear is the cgroups CPU throttling, which sometimes but not typically results in observable (milliseconds) latency degradations. This used to be a supervisor process; now it's partially due to a health-check process running within a container, but teams run their own helper sidecards occasionally.

Selfishly, we'd opt for preserving the floor behavior so that automaxprocs can be deprecated, having floor/ceil configurable, or implementing something like uber-go/automaxprocs#14 as an extra knob.

@ldemailly
Copy link

ldemailly commented Apr 7, 2025

Side note from my experience, a minimum of 2 is best so I support that default value as well as rounding up for similar reason

It helps with autoscalers too (to actually go over)

I wish we could get requests information somehow (maybe work with k8s folks) because using limits waste sys cpu (and unlike memory cpu is fairly elastic) so we're not using them in prod

We did hit the classic "oh I run on a 192 core vm and someone forgot to set GOMAXPROCS to 4 on their 2 vcpu pod and suddenly performance is terrible as a result" (reported initially as a istio pilot issue)

@prattmic
Copy link
Member Author

prattmic commented Apr 8, 2025

@vboulineau @cpuguy83

Correct, typically one would not be able to walk up the cgroup tree, especially with cgroupv2 where cgroup namespaces are a must.

Thanks, that’s a good point. I’ll look into this more, it would be unfortunate if containers can’t actually see their effective CPU limit.

@prattmic
Copy link
Member Author

prattmic commented Apr 8, 2025

@thediveo

Is it of any importance when the minimum for GOMAXPROC is 2 and the cpuset has only a single logical CPU, does this impact performance?

Note that the minimum of 2 only applies to GOMAXPROCS based on the CPU group quota. If there is only one logical CPU we will still use GOMAXPROCS=1. See point (2) of the Proposal section.

@cpuguy83
Copy link

cpuguy83 commented Apr 8, 2025

@vboulineau @cpuguy83

Correct, typically one would not be able to walk up the cgroup tree, especially with cgroupv2 where cgroup namespaces are a must.

Thanks, that’s a good point. I’ll look into this more, it would be unfortunate if containers can’t actually see their effective CPU limit.

Sorry, you'd still be able to see the quota value set on the current cgroup, just not anything higher up in the tree.

@prattmic
Copy link
Member Author

prattmic commented Apr 8, 2025

Regarding fractional CPU limit component, it seems to me that there are two primary reasons for setting a fractional component, which have different desired behavior. Let me know if I missed use cases.

For example, consider a 2.5 CPU limit.

Case one is that there are two applications running inside the cgroup. One is the “primary” application, which you expect to use 2 CPU. The second is a “sidecar” process which runs infrequently. The extra 0.5 CPU is intended to provide headroom for the sidecar process on top of the primary application.

In this case you want GOMAXPROCS to round down because the entire point is to leave space for the sidecar, which is exactly what rounding down will achieve.

I would argue that this case would be best served by setting up child cgroups with explicit 2 CPU and 0.5 CPU limits, but I acknowledge that that is a lot more complex than just adjusting the limit to be “good enough”.

Case two is that there is a single application in the cgroup, and this limit has been set (either manually by a human, or via an automated system) as the actual CPU requirement needed for this application to achieve its goals (throughput, latency, whatever the relevant metric is).

This may mean it is expected to “burst” somewhat, with a 50% duty cycle of 3 CPU/2 CPU. Or the latency impact of some fractional amount of throttling is simply not a problem for this application.

In this case, you want GOMAXPROCS to round up to 3. If we round down, then any CPU limit in the range [2, 3) behaves identically, preventing the application from actually fully utilizing all CPU as intended.

It is hard for me to say which case is more common, or more “important”. Whichever we choose, the good news is that it is still possible to implement the other policy with something like automaxprocs.

I don’t think we should use math.Round. I think the effects of either case are subtle and potentially confusing enough with only a single behavior. Having two behaviors for <0.5 and >0.5 would be worse. This form of rounding also doesn’t actually capture which case is the intent of the limit.

@prattmic
Copy link
Member Author

prattmic commented Apr 8, 2025

@cpuguy83 Right. What I mean is that the leaf cgroup may have a quota of 400ms / 100ms period. It’s parent may have a quota of 200ms / 100ms. The effective quota of the leaf is 200ms / 100ms, since the parent’s limit will be hit before the child.

Perhaps I misunderstand though and the kernel already displays the effective limit in the leaf? Also presumably such a setup is quite rare as it doesn’t seem very useful.

@creker
Copy link

creker commented Apr 8, 2025

Regarding round up behaviour. What we do for all of our services is to set CPU limits to whole number of cores + 100m. The reasoning is not to allow bursts (we don’t want them) but to use 100m as a headroom to absolutely avoid any throttling. We do not use cgo or sidecars.

If not changing the default, it would be great to have this behaviour configurable. Doesn’t really matter how.

@rrb3942
Copy link

rrb3942 commented Apr 8, 2025

The Go runtime will automatically update GOMAXPROCS if the CPU affinity or cgroup CPU limit change. This is done with a low frequency scan of the current environment. Users may call SetDefaultGOMAXPROCS to manually trigger updates

Even though this discussion is about cpu limits, if GOMAXPROCS is going to be dynamic, I wonder if the number of logical CPUs should also be re-checked and adjusted for. Hot-Plug CPUs for VM's has been a thing for awhile, and automatically scaling to resources would be beneficial.

@prattmic
Copy link
Member Author

prattmic commented Apr 8, 2025

@rrb3942 The secret is that the way we count the total number of logical CPUs on Linux is to simply look at how many appear in sched_getaffinity(2). So even though the proposal technically makes a distinction between "total logical CPUs" and "number of logical CPUs available via sched_getaffinity(2)" (which could be lower), nowhere in the runtime actually cares about the distinction, we always want the lower number.

All that is to say that when I propose auto-updating via sched_getaffinity(2) in addition to the cgroup, that should also cover CPU hotplug.

@vboulineau
Copy link

@cpuguy83 Right. What I mean is that the leaf cgroup may have a quota of 400ms / 100ms period. It’s parent may have a quota of 200ms / 100ms. The effective quota of the leaf is 200ms / 100ms, since the parent’s limit will be hit before the child.

Perhaps I misunderstand though and the kernel already displays the effective limit in the leaf? Also presumably such a setup is quite rare as it doesn’t seem very useful.

That's exactly what happens, you can be aware of the 400 / 100 but not the 200 / 100.
In terms of usage I know of, it's used in AWS ECS Tasks (task-level limits) and I guess (but I did not yet check myself) the new Pod-level resources in Kubernetes.

That being said, all existing tools/languages face the same limitations AFAIK.

@DavidS-ovm
Copy link

[...] you can be aware of the 400 / 100 but not the 200 / 100. [...]

That being said, all existing tools/languages face the same limitations AFAIK.

So it really becomes a "you're holding it wrong" at that point, when people have their cgroups set up like that?

@sudo-bmitch
Copy link

Regarding fractional CPU limit component, it seems to me that there are two primary reasons for setting a fractional component, which have different desired behavior. Let me know if I missed use cases.

For example, consider a 2.5 CPU limit.

Case one is that there are two applications running inside the cgroup. One is the “primary” application, which you expect to use 2 CPU. The second is a “sidecar” process which runs infrequently. The extra 0.5 CPU is intended to provide headroom for the sidecar process on top of the primary application.

Outside of CPU limits, we have this behavior today with every Go application. The non-limited assumption is that Go limits itself to the full resources of the host, not a self imposed limit below that to allow for other applications on the same VM. Go can always use less CPU (either because of scheduling or because the workload is not CPU bound).

For scenarios where users want to control multiple applications independently within the same cgroup, users can set the limits manually per Go application. There could be multiple sidecars using more than 1 CPU collectively, or the Go app could be the sidecar itself, so rounding down would miss a lot of user scenarios.

Case two is that there is a single application in the cgroup, and this limit has been set (either manually by a human, or via an automated system) as the actual CPU requirement needed for this application to achieve its goals (throughput, latency, whatever the relevant metric is).

I suspect this is a fairly common use case with microservices, and if a user intentionally sets the limit to 2.5 CPUs, Go should round that up to 3, and allow the cgroup/scheduler to throttle the half CPU.

@aktau
Copy link
Contributor

aktau commented Apr 9, 2025

Outside of CPU limits, we have this behavior today with every Go application. The non-limited assumption is that Go limits itself to the full resources of the host, not a self imposed limit below that to allow for other applications on the same VM. Go can always use less CPU (either because of scheduling or because the workload is not CPU bound).

On Linux, without any cgroup shenanigans, applications are not hard-throttled as they are when they exceed cpu.cfs_quota_us / cpu.cfs_period_us. Instead, if there is too much demand, the kernel will divide CPU time gracefully as per the process' shares (there's also process priorities, I forget the exact interaction, but generally most processes on a Linux system run in the same priority band).

The hard-throttling can be a problem: they can yield noticeable latency spikes.

@Deleplace
Copy link
Contributor

  1. The Go runtime will automatically update GOMAXPROCS if the CPU affinity or cgroup CPU limit change. This is done with a low frequency scan of the current environment. Users may call SetDefaultGOMAXPROCS to manually trigger updates.

Both the automated scan and the manual trigger feature look useful to me. I'm thinking about Cloud Run where the Startup CPU boost provides more cores during the first seconds of the app, and then fewer cores. Because of this boost, the sweet spot may change while the app is running!

@rabbbit
Copy link

rabbbit commented Apr 10, 2025

Hey @prattmic,

I suspect it's waaaay too far for this discussion, but to call it out since you describe cfs in depth in the description. cfs_burst seems to have some promising results. I imagine teaching the runtime about it (for some GC emergencies) would not be an entirely trivial thing? :)

Copying from the patchset

	Without burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 933
	99.5000th: 981
	99.9000th: 3068
	min=0, max=20054
	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

	With burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 9
	*99.0000th: 12
	99.5000th: 13
	99.9000th: 19
	min=0, max=406
	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%

@thediveo
Copy link
Contributor

@rabbit just for better understanding the figures, what latencies are these?

@prattmic
Copy link
Member Author

I have made a variety of small edits to the top-level proposal comment:

  • More consistent use of the term "logical CPU".
  • I've added a more through example in the appendix covering the downside of this proposal for bursty workloads.
  • Added cpu.cpusets to the Linux background.
  • Added a note about container runtimes hiding parent cgroups.

@prattmic
Copy link
Member Author

In #57175 (comment), we discussed this proposal. Several new perspectives on the rounding behavior were brought up there that I want to make sure are reflected here:

  • This proposal will reduce GOMAXPROCS vs today. Rounding down will reduce GOMAXPROCS even more. Therefore you could say that rounding up is a more conservative change (which we could further adjust later).
  • The issues applications could encounter from rounding up (latency from exceeding CPU limit) may be more difficult to notice and diagnose than the issues encountered from rounding down (persistent under-utilization of CPU).
  • For sidecars, some users start with a base CPU limit (e.g., 4) and add a small fraction (e.g., 0.25) for their sidecar in the same container (e.g., total CPU limit 4.25). Others start from a base CPU limit (e.g., 4) and subtract away a small fraction (e.g., 0.25) for their sidecar in another container (e.g., main container CPU limit 3.75, sidecar 0.25).
    • The latter case may be disappointed by rounding down because of the large drop (1 CPU less when on 0.25 was intended). One could say this is an argument in favor of using math.Round.
    • Choosing to round down because of the former case may be overfitting to one particular way of doing things.
    • Observation: the former case depends on lower GOMAXPROCS to reserve CPU for their sidecar. They must already be setting a custom GOMAXPROCS today (this may be via automaxprocs) because the default would otherwise be too high. So a change to the default won't have immediate effect.
  • Power users aware of the trade-offs of this choice can already set GOMAXPROCS explicitly. The default should probably aim to do best thing for the (probably majority of) users that don't know this is something to consider at all.

@dfinkel
Copy link
Contributor

dfinkel commented Apr 11, 2025

@rabbbit I wonder if the burst mechanism should provide a tweak to the GOMAXPROCS value from the strict quota. It can't provide a fractional value, but it could allow for adding integer values if it's added to the base quota. (maybe it makes sense to ignore that burst "quota" for GC unless the GC assist would kick in?)

@thediveo looks like the latencies are for some arbitrary Java app that Alibaba runs (or ran, since the patch is from 2021).

I think it would be nice, but wildly out of scope to support some sort of bursting support in the the Go scheduler that would align with either just the CFS quota/period scheme and by extension burst feature.

@thepudds
Copy link
Contributor

In #57175 (comment), Michael K. made what I thought was an interesting comment:

There are two contrasting views of GOMAXPROCS. The scheduler sees it as parallelism. The GC sees it as available CPU time.

I'm generally aware of some of the GOMAXPROCS tradeoffs & interplay with the GC, but that phrasing was new to me.

I wonder if we could apply that split view to the question here about whether to round up or down. For example, if the cpu limit is specified as 3.25 logical cores, should that end up with GOMAXPROCS=3 or GOMAXPROCS=4?

I wonder if it might make sense to split it. For example, when determining whether to run an idle worker, perhaps it is effectively rounded down, but rounded for up for the rest of the scheduler.

In other words, it could be that GOMAXPROCS itself is rounded up (so the work of the app can burst up to 4 active threads in our example, which also might more accurately signal load to some flavors of auto scaling or observability), but idle GC workers behave as if it had been rounded down to 3 (so that the idle workers by themselves won't be responsible for exceeding the 3.25 quota)

Part of the rationale might be that if we were to instead just round up for everything, the rounded up amount is not really "idle" CPU in the same sense that a dedicated core might be "idle". For a fractional quota, the idle workers using all the CPU has a decent chance of triggering the heavy hand of the entire app being throttled by the kernel for the remainder of the measurement period (defaulting to a 100ms measurement period in many cases).


Just now I tried a hack to the runtime to try to emulate that suggestion.

For a quick experiment, I picked the test app from #58106 (comment), which used thepudds/heapbench to roughly emulate a production Uber service in terms of some core characteristics that matter to the GC. (In general, heapbench tries to lightly model a real app as seen by the GC. It has a few different basic knobs to emulate different apps, and it mostly attempts to keep the offered load consistent even if it is paused and resumed by the kernel -- it roughly builds a backlog of work while paused).

 GC end-to-end cycle duration:  ~15 sec (avg)
                    Live heap:  ~2000 MiB (avg)
                     CPU util:  ~1.1 cores (avg)
                   GOMEMLIMIT:   3 GiB 
                         GOGC:   off

In the original example, it ran in production with GOMAXPROCS=3, but for the purposes of this test, I pretended that there was a 3.25 quota (via --cpus=3.25), and I manually rounded up the actual GOMAXPROCS env variable to GOMAXPROCS=4 (as if that's what happened under this proposal). I also did a quick hack to the runtime that attempts to avoid running an idle GC worker on the last idle P (so that an idle GC worker doesn't try to use the last available logical core, even though other goroutines could use it due to GOMAXPROCS=4).

Sample run:

$ go install golang.org/dl/go1.24.1@latest && go1.24.1 download
$ go1.24.1 install github.com/thepudds/heapbench@8aef707
 
$ GOMAXPROCS=4 GOGC=off GOMEMLIMIT=3GiB GODEBUG=gctrace=1 heapbench -baseheap=2000 -garbagerate=42 \
  -leakrate=0 -jobrate=100 -worktime=9ms -stats=10000s

From a ~50 second test of each:

                      go1.24    hacked runtime    delta
                                                   
throttled_usec     2,317,414           250,941   -89.1%
nr_throttled              31                24   -22.6%
CPU usage (cores)       1.13              1.14        ~

The count of throttled periods did go down, but it looks like the impact of each throttle event went down even more.

I'll emphasize again "quick hack", including not sure it was correct. Other caveats include it's been a bit since I've looked at this closely, so maybe I made some measurement mistake. Also, some chance I happened to pick a test app that is in a sweet spot for this change.

But setting aside whether that was a valid experiment, I guess my main point is there might be a case to round down for idle workers while not rounding down for other things.


Finally, there's also the question of rounding for the 25% dedicated workers. I think there might already be a concept of a fractional worker, so if we do split the answer for rounding, maybe there's no rounding at all for the dedicated workers (if that is indeed an option), or maybe it makes sense to effectively round down for dedicated workers too (though it might be that the pros/cons of rounding down for idle workers might have a stronger case compared to dedicated workers).

@prattmic
Copy link
Member Author

We will scan with a minimum period of 30s, up to the maximum sysmon period (1 minute, due to forced GCs).

When choosing this default of 30s, I was thinking primarily about scaling down, and that getting extra throttling for up to 30s after scaling down is not a big deal.

It was pointed out to me that the opposite case is interesting as well. An autoscaling system may want to quickly ramp up CPU limits in response to a sudden load spike. For example, ramping up in O(seconds) after a load spike. A 30s scan period would add significant latency to such a system, as Go cannot use the additional CPU until it changes GOMAXPROCS.

I find this a strong argument in favor of decreasing the period to something faster, like 1s or 5s. Though I'd like to actually implement the scanning to measure the CPU cost. I expect it to be low.

@aclements
Copy link
Member

The proposal committee is generally on board with this change. There are clearly questions about policy details that are still being worked out, but one if the advantages is that the API is pretty minimal so we can change policy details in the future (as long as we don't promise too much).

@aclements aclements moved this to Incoming in Proposals Apr 23, 2025
@aclements aclements moved this from Incoming to Active in Proposals Apr 23, 2025
@aclements
Copy link
Member

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— aclements for the proposal review group

@prattmic
Copy link
Member Author

There are clearly questions about policy details that are still being worked out, but one if the advantages is that the API is pretty minimal so we can change policy details in the future (as long as we don't promise too much).

Agreed. I have left the actual text in the GOMAXPROCS API docs intentionally vague for this reason. The proposed documentation on policy is basically just "the Go runtime selects an appropriate default value". I really want to leave space for the default to be best we can do for the overall ecosystem over time, rather than freeze at something now and force future users to opt in to better defaults. Of course, Hyrum's law can always come up.

I brought up the policy questions explicitly in this proposal because even if we can change things in the future, I wanted feedback from everyone on the choices, and all of the perspectives have been very helpful.

Regarding rounding behavior, from the feedback I've seen thus far, I am leaning towards the round up behavior, primarily for the aspects of that being more amenable to autoscaling systems (brought up in the original proposal), as well as rounding up being a smaller change, adding/removing fractional CPU being a somewhat arbitrary project choice, and for projects that prefer round down behavior this can't break them because they must already be manually setting GOMAXPROCS today (all brought up in #73193 (comment)).

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/668638 mentions this issue: WIP: internal/runtime/cgroup: CPU cgroup discovery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Implementation Issues describing a semantics-preserving change to the Go implementation. Proposal
Projects
Status: No status
Status: Active
Development

No branches or pull requests