-
Notifications
You must be signed in to change notification settings - Fork 18k
proposal: runtime: CPU limit-aware GOMAXPROCS default #73193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related Issues
Related Code Changes Related Documentation Related Discussions
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Since there isn't a strong opinion abound rounding up or down, https://pkg.go.dev/math#Round makes more sense to me. |
Although the following topic was considered out of the scope of this proposal, I think it is of critical importance in CPU allocation.
When GC bursts consume 100% of the CPU quota, auto/elastic instance scaling (which depend on P99 utilization behavior) end up allocating more service instances. Instead, if we run dedicated workers on 25% of “CPU request” Ps, and only idle workers on Ps up to a total CPU usage of the CPU request as suggested here, I expect many workloads to benefit by not going to 100% utilization.
This is a good idea. In network congestion control, AIMD https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease is often found to be effective. We can additively increase GC target CPU and decrease multiplicatively to work for those applications where GC needs to catch up with the allocation rate. |
Note: there's also the cpuset controller in the cgroup that limits which cores the program can be scheduled to. |
@cpuguy83 Yes, I have a small note about this deep in the background. I probably should have included this in the Linux section as well.
I believe that cpuset limits are visible in |
I am curious if you have a plan to implement this. I think that in most containers cases, the process will only see its own leaf cgroup as |
Correct, typically one would not be able to walk up the cgroup tree, especially with cgroupv2 where cgroup namespaces are a must. |
minor terminology note: Linux uses the word (logical) CPU for a seemingly independent execution unit, and core is already implying a particular chip topology. For instance, |
Is it of any importance when the minimum for |
Uber continues to see this in production today. Many, many teams run with N.2, N.4 or N.8 CPU limits - the primary fear is the cgroups CPU throttling, which sometimes but not typically results in observable (milliseconds) latency degradations. This used to be a supervisor process; now it's partially due to a health-check process running within a container, but teams run their own helper sidecards occasionally. Selfishly, we'd opt for preserving the |
Side note from my experience, a minimum of 2 is best so I support that default value as well as rounding up for similar reason It helps with autoscalers too (to actually go over) I wish we could get We did hit the classic "oh I run on a 192 core vm and someone forgot to set GOMAXPROCS to 4 on their 2 vcpu pod and suddenly performance is terrible as a result" (reported initially as a istio pilot issue) |
Thanks, that’s a good point. I’ll look into this more, it would be unfortunate if containers can’t actually see their effective CPU limit. |
Note that the minimum of 2 only applies to GOMAXPROCS based on the CPU group quota. If there is only one logical CPU we will still use GOMAXPROCS=1. See point (2) of the Proposal section. |
Sorry, you'd still be able to see the quota value set on the current cgroup, just not anything higher up in the tree. |
Regarding fractional CPU limit component, it seems to me that there are two primary reasons for setting a fractional component, which have different desired behavior. Let me know if I missed use cases. For example, consider a 2.5 CPU limit. Case one is that there are two applications running inside the cgroup. One is the “primary” application, which you expect to use 2 CPU. The second is a “sidecar” process which runs infrequently. The extra 0.5 CPU is intended to provide headroom for the sidecar process on top of the primary application. In this case you want GOMAXPROCS to round down because the entire point is to leave space for the sidecar, which is exactly what rounding down will achieve. I would argue that this case would be best served by setting up child cgroups with explicit 2 CPU and 0.5 CPU limits, but I acknowledge that that is a lot more complex than just adjusting the limit to be “good enough”. Case two is that there is a single application in the cgroup, and this limit has been set (either manually by a human, or via an automated system) as the actual CPU requirement needed for this application to achieve its goals (throughput, latency, whatever the relevant metric is). This may mean it is expected to “burst” somewhat, with a 50% duty cycle of 3 CPU/2 CPU. Or the latency impact of some fractional amount of throttling is simply not a problem for this application. In this case, you want GOMAXPROCS to round up to 3. If we round down, then any CPU limit in the range [2, 3) behaves identically, preventing the application from actually fully utilizing all CPU as intended. It is hard for me to say which case is more common, or more “important”. Whichever we choose, the good news is that it is still possible to implement the other policy with something like automaxprocs. I don’t think we should use math.Round. I think the effects of either case are subtle and potentially confusing enough with only a single behavior. Having two behaviors for <0.5 and >0.5 would be worse. This form of rounding also doesn’t actually capture which case is the intent of the limit. |
@cpuguy83 Right. What I mean is that the leaf cgroup may have a quota of 400ms / 100ms period. It’s parent may have a quota of 200ms / 100ms. The effective quota of the leaf is 200ms / 100ms, since the parent’s limit will be hit before the child. Perhaps I misunderstand though and the kernel already displays the effective limit in the leaf? Also presumably such a setup is quite rare as it doesn’t seem very useful. |
Regarding round up behaviour. What we do for all of our services is to set CPU limits to whole number of cores + 100m. The reasoning is not to allow bursts (we don’t want them) but to use 100m as a headroom to absolutely avoid any throttling. We do not use cgo or sidecars. If not changing the default, it would be great to have this behaviour configurable. Doesn’t really matter how. |
Even though this discussion is about cpu limits, if GOMAXPROCS is going to be dynamic, I wonder if the number of logical CPUs should also be re-checked and adjusted for. Hot-Plug CPUs for VM's has been a thing for awhile, and automatically scaling to resources would be beneficial. |
@rrb3942 The secret is that the way we count the total number of logical CPUs on Linux is to simply look at how many appear in All that is to say that when I propose auto-updating via |
That's exactly what happens, you can be aware of the That being said, all existing tools/languages face the same limitations AFAIK. |
So it really becomes a "you're holding it wrong" at that point, when people have their cgroups set up like that? |
Outside of CPU limits, we have this behavior today with every Go application. The non-limited assumption is that Go limits itself to the full resources of the host, not a self imposed limit below that to allow for other applications on the same VM. Go can always use less CPU (either because of scheduling or because the workload is not CPU bound). For scenarios where users want to control multiple applications independently within the same cgroup, users can set the limits manually per Go application. There could be multiple sidecars using more than 1 CPU collectively, or the Go app could be the sidecar itself, so rounding down would miss a lot of user scenarios.
I suspect this is a fairly common use case with microservices, and if a user intentionally sets the limit to 2.5 CPUs, Go should round that up to 3, and allow the cgroup/scheduler to throttle the half CPU. |
On Linux, without any cgroup shenanigans, applications are not hard-throttled as they are when they exceed The hard-throttling can be a problem: they can yield noticeable latency spikes. |
Both the automated scan and the manual trigger feature look useful to me. I'm thinking about Cloud Run where the Startup CPU boost provides more cores during the first seconds of the app, and then fewer cores. Because of this boost, the sweet spot may change while the app is running! |
Hey @prattmic, I suspect it's waaaay too far for this discussion, but to call it out since you describe cfs in depth in the description. cfs_burst seems to have some promising results. I imagine teaching the runtime about it (for some GC emergencies) would not be an entirely trivial thing? :) Copying from the patchset
|
@rabbit just for better understanding the figures, what latencies are these? |
I have made a variety of small edits to the top-level proposal comment:
|
In #57175 (comment), we discussed this proposal. Several new perspectives on the rounding behavior were brought up there that I want to make sure are reflected here:
|
@rabbbit I wonder if the burst mechanism should provide a tweak to the GOMAXPROCS value from the strict quota. It can't provide a fractional value, but it could allow for adding integer values if it's added to the base quota. (maybe it makes sense to ignore that burst "quota" for GC unless the GC assist would kick in?) @thediveo looks like the latencies are for some arbitrary Java app that Alibaba runs (or ran, since the patch is from 2021). I think it would be nice, but wildly out of scope to support some sort of bursting support in the the Go scheduler that would align with either just the CFS quota/period scheme and by extension burst feature. |
In #57175 (comment), Michael K. made what I thought was an interesting comment:
I'm generally aware of some of the I wonder if we could apply that split view to the question here about whether to round up or down. For example, if the cpu limit is specified as 3.25 logical cores, should that end up with I wonder if it might make sense to split it. For example, when determining whether to run an idle worker, perhaps it is effectively rounded down, but rounded for up for the rest of the scheduler. In other words, it could be that Part of the rationale might be that if we were to instead just round up for everything, the rounded up amount is not really "idle" CPU in the same sense that a dedicated core might be "idle". For a fractional quota, the idle workers using all the CPU has a decent chance of triggering the heavy hand of the entire app being throttled by the kernel for the remainder of the measurement period (defaulting to a 100ms measurement period in many cases). Just now I tried a hack to the runtime to try to emulate that suggestion. For a quick experiment, I picked the test app from #58106 (comment), which used thepudds/heapbench to roughly emulate a production Uber service in terms of some core characteristics that matter to the GC. (In general, heapbench tries to lightly model a real app as seen by the GC. It has a few different basic knobs to emulate different apps, and it mostly attempts to keep the offered load consistent even if it is paused and resumed by the kernel -- it roughly builds a backlog of work while paused).
In the original example, it ran in production with Sample run:
From a ~50 second test of each:
The count of throttled periods did go down, but it looks like the impact of each throttle event went down even more. I'll emphasize again "quick hack", including not sure it was correct. Other caveats include it's been a bit since I've looked at this closely, so maybe I made some measurement mistake. Also, some chance I happened to pick a test app that is in a sweet spot for this change. But setting aside whether that was a valid experiment, I guess my main point is there might be a case to round down for idle workers while not rounding down for other things. Finally, there's also the question of rounding for the 25% dedicated workers. I think there might already be a concept of a fractional worker, so if we do split the answer for rounding, maybe there's no rounding at all for the dedicated workers (if that is indeed an option), or maybe it makes sense to effectively round down for dedicated workers too (though it might be that the pros/cons of rounding down for idle workers might have a stronger case compared to dedicated workers). |
When choosing this default of 30s, I was thinking primarily about scaling down, and that getting extra throttling for up to 30s after scaling down is not a big deal. It was pointed out to me that the opposite case is interesting as well. An autoscaling system may want to quickly ramp up CPU limits in response to a sudden load spike. For example, ramping up in O(seconds) after a load spike. A 30s scan period would add significant latency to such a system, as Go cannot use the additional CPU until it changes GOMAXPROCS. I find this a strong argument in favor of decreasing the period to something faster, like 1s or 5s. Though I'd like to actually implement the scanning to measure the CPU cost. I expect it to be low. |
The proposal committee is generally on board with this change. There are clearly questions about policy details that are still being worked out, but one if the advantages is that the API is pretty minimal so we can change policy details in the future (as long as we don't promise too much). |
This proposal has been added to the active column of the proposals project |
Agreed. I have left the actual text in the GOMAXPROCS API docs intentionally vague for this reason. The proposed documentation on policy is basically just "the Go runtime selects an appropriate default value". I really want to leave space for the default to be best we can do for the overall ecosystem over time, rather than freeze at something now and force future users to opt in to better defaults. Of course, Hyrum's law can always come up. I brought up the policy questions explicitly in this proposal because even if we can change things in the future, I wanted feedback from everyone on the choices, and all of the perspectives have been very helpful. Regarding rounding behavior, from the feedback I've seen thus far, I am leaning towards the round up behavior, primarily for the aspects of that being more amenable to autoscaling systems (brought up in the original proposal), as well as rounding up being a smaller change, adding/removing fractional CPU being a somewhat arbitrary project choice, and for projects that prefer round down behavior this can't break them because they must already be manually setting GOMAXPROCS today (all brought up in #73193 (comment)). |
Change https://go.dev/cl/668638 mentions this issue: |
Overview
Change the Go runtime on Linux to use CPU cgroup quota limits to set the default value of GOMAXPROCS.
This is a concrete proposal for the ideas discussed in #33803. I've included a lot of background, but you can jump directly to the proposal.
Background
Go
GOMAXPROCS specifies the maximum number of goroutines that may be run in parallel. In the implementation, this corresponds to the maximum number of system threads that will execute a goroutine at a given time. In other words, GOMAXPROCS specifies the maximum parallelism of a Go program.
Note that GOMAXPROCS does not apply to threads created by (and running in) C, or to “blocking” system calls or cgo calls1. Thus the actual maximum parallelism of a Go process may exceed GOMAXPROCS.
Linux
The Linux kernel has several different mechanisms that impact thread scheduling.
The most fundamental is the actual number of logical CPUs in the machine. That is, all CPUs, counting hyperthreads, if any. This defines the maximum parallelism of the machine itself, as the Linux kernel can only run one thread on a CPU at a time.
Closely related is the CPU affinity mask of a process, set by
sched_setaffinity(2)
. This specifies the set of logical CPUs that a given process is allowed to run on. The Linux kernel will never schedule any threads on CPU not in the mask, even if they are idle. This provides a mechanism to limit the maximum parallelism of a process by reducing its available CPU set. Unfortunately, this is fairly rudimentary as it requires the user to manually allocate CPU resources, which may result in subpar overall utilization.Both of these mechanisms directly correspond to maximum parallelism, so they are the basis for the GOMAXPROCS default today. GOMAXPROCS defaults to either the total number of logical CPUs on the machine, or the number of logical CPUs available in the CPU affinity mask, whichever is lower.
Linux’s CPU cgroups provide additional scheduling controls commonly used by container runtimes/orchestration systems such as Docker or Kubernetes. Note that Linux has both v1 and v2 cgroups. Both provide the same CPU controls, just with slightly different names.
The first is
cpu.cfs_quota_us
/cpu.cfs_period_us
(v1) orcpu.max
(v2). This defines the maximum CPU time the cgroup may use within some period window. For example, a typical period is 100ms. If the quota is set to 800ms, then the cgroup may use 800ms of CPU time every 100ms of wall time. The simple case here would be that 8 threads can run in parallel, as each uses 100ms of CPU time per 100ms of wall time.However, note that this is not a limit on maximum parallelism. For instance, it is also allowed for 16 threads to run for 50ms and then do nothing for 50ms. This allows bursts of higher parallelism, provided the cgroup uses less CPU later in the period. If a cgroup exceeds its quota, all threads in the cgroup are descheduled until the end of the period.
The bursting behavior makes this not quite a perfect match for GOMAXPROCS, as GOMAXPROCS does not allow bursts, however this is otherwise conceptually similar to GOMAXPROCS.
Note that cgroups are hierarchical, so the effective quota for a cgroup is the minimum quota of any cgroup up the hierarchy, assuming the periods are all identical.
Second, there is
cpu.shares
(v1) orcpu.weights
(v2). These do not set hard limits at all, but are instead relative priorities given to the kernel scheduler. i.e., threads in cgroup A with double the shares of cgroup B are twice as likely to run.Container runtimes will use this to set relative priorities of different containers. For instance, suppose you have a machine with 8 logical CPUs, and two containers A and B. Both containers want to run 8 threads in parallel, but you would like A to use approximately 1 CPU and B to use approximately 7 CPUs. If you set A with shares of 1 and B with shares of 7, the scheduler will run 1 of A’s threads at a time, and 7 of B’s threads. However, the ratio only matters if the machine is overcommitted. If B is completely idle, A will run all 8 threads, as there is no hard limit on its CPU usage.
So, the relative values of CPU shares make them difficult to interpret without context, and the lack of an actual limit makes them a bad fit for GOMAXPROCS.
Note that both .NET and Java both use the CPU quota to determine CPU limits. Java originally also considered CPU shares, but reverted the change after realizing it is a bad fit.
Finally,
cpuset.cpus
defines the set of CPUs available for scheduling within the cgroup. This is equivalent to usingsched_setaffinity(2)
, except that it applies on the cgroup level. The result of this configuration is visible to applications viasched_getaffinity(2)
.Higher level
While the Linux kernel values are the primary things we have to work with, let’s take a look at how users actually configure these values.
Docker
Docker directly provides
--cpu-quota
,--cpu-period
, and--cpu-shares
flags which directly correspond to the cgroup options, as well as a slightly more general--cpus
, which simply uses the CPU quota with a predefined period.Kubernetes
Kubernetes is a bit higher level. Kubernetes containers specify CPU limits and requests. From the summary of how these work:
Historically, CPU requests and limits could not be changed after starting the container. However, Kubernetes has alpha support for “in place vertical scaling”, which allows changing the CPU request and limit while the container is running. This is scheduled to be promoted to beta in Kubernetes 1.33.
Similarly, historically CPU requests and limits applied only to individual containers. Kubernetes also has alpha support for “pod-level resource specification”, which allows specifying requests and limits for a pod, which is a group of one or more containers. There is no published target beta or GA schedule for this feature. I believe this results in a multi-level cgroup hierarchy.
A large fraction of Kubernetes users set CPU requests but not limits, with the intention that they want to consume otherwise idle cycles. Doing so should theoretically increase overall fleet CPU utilization, and also help avoid production issues caused by hitting CPU limits when there are sufficient idle cycles to handle the load.
Since CPU limits correspond directly to quotas, that makes them easy to work with.
CPU requests correspond indirectly to shares. If the Go runtime had access to the specific request value rather than the raw shares value, that resolves the problem of lack of context described above. However, the lack of an upper limit is still problematic. The CPU request is the minimum CPU. We could set GOMAXPROCS to this value, but that is overly conservative, preventing any use of parallelism beyond the minimum request. This is also in conflict with the users’ intention to use extra idle cycles over the request. We could set it to a higher value, but it is unclear how much higher we should go.
In addition to CPU requests and limits, Kubernetes also has an optional CPU Manager, which assigns exclusive CPUs to containers when enabled. These scheduling restrictions are applied via CPU cgroup
cpuset.cpus
configuration, which is visible to applications viasched_getaffinity(2)
.Mesos
I am less familiar with Mesos, but their resource summary is similar to Kubernetes. Tasks can have CPU requests and limits, which translate to CPU cgroup shares and limits, respectively.
I do not know which configurations are most common with Mesos.
GOMAXPROCS default today
Since Go 1.5, GOMAXPROCS has had a default of “the number of logical CPUs available”. Concretely, this means either the total number of logical CPUs on the machine, or the number available in the CPU affinity mask, whichever is lower.
This default works well for single-tenant systems (where the Go program is approximately the only thing running), allowing full use of the machine’s parallelism without requiring any configuration from users.
For multi-tenant systems without isolation (e.g., a laptop running multiple active applications), this default also tends to work fairly well. If the application only receives, say, 50% of the machine’s maximum parallelism on average, it may be better to use a lower value of GOMAXPROCS. On the other hand, the other tenants could reduce their CPU usage at any time, allowing this application to use more. If we used a lower value of GOMAXPROCS, then our application could not utilize this newly available parallelism.
Many of us (@mknyszek, @aclements, @cherrymui, etc) have long brainstormed a mechanism to eliminate the need for GOMAXPROCS entirely and instead dynamically discover the available parallelism and make adjustments on the fly in order to better serve multi-tenant systems in cases like above. Unfortunately, we do not expect to achieve something like this in the near term.
Additionally, the current GOMAXPROCS defaults tend to work OK for multi-tenant systems with no isolation simply because it is uncommon for such systems to run extremely overcommitted, which is when the default performs the worst.
Multi-tenant systems with isolation are typically container orchestration systems like Kubernetes. These will run multiple applications on a single machine with some form of CPU isolation for each application (CPU cgroup quota or shares). For example, a 64 CPU machine may host 8 applications, each of which have a CPU quota of 8 CPUs.
With a default GOMAXPROCS of 64, but only 8 CPUs of (average) parallelism, these applications are quite mismatched from reality. Downsides from this mismatch include:
Proposal
Given this background, my proposal is that:
GOMAXPROCS
is not set in the environment, the Go runtime will determine:sched_getaffinity(2)
cpu.cfs_quota_us / cpu.cfs_period_us
(or cgroup v2 equivalent).max(2, ceil(effective_cpu_limit))
.runtime
,func SetDefaultGOMAXPROCS()
sets GOMAXPROCS based on the default behavior and returns the new value.SetDefaultGOMAXPROCS
to manually trigger updates.The Go runtime will support querying the CPU cgroup quota from either cgroups v1 or v2. Note that mixed v1 and v2 controllers are supported by Linux. Go should support those as well.
This change in behavior is controlled by a compatibility GODEBUG,
cgroupgomaxprocs=1
. This defaults tocgroupgomaxprocs=0
for older language versions. Thus, behavior changes only when upgrading the language version, not when upgrading the toolchain.The updated and new documentation:
Discussion
Given the details in the background, cgroup CPU shares are bad fit for GOMAXPROCS and cannot be used.
cgroup CPU quota is a much better fit, but is not perfect. The CPU quota allows bursting to higher parallelism, which GOMAXPROCS does not. By setting GOMAXPROCS to the quota, we potentially increase latency for bursty workloads. See the example below for a more thorough explanation.
While this is disappointing, I still believe this makes a better default than the current default of total CPU count, as it mitigates the numerous downsides of a GOMAXPROCS mismatch, which in my opinion are more extreme than lost burst ability. I do expect some workloads will want to manually increase GOMAXPROCS to allow bursts.
Additionally, the CPU quota and period may average to a fractional number of CPUs (e.g., 0.5 or 2.5). GOMAXPROCS does not allow fractional parallelism. If the quota is less than 1, we must round up. If the quota is greater than 1, we can either round up or down. Arguments can go either way. If we round down, we are very unlikely to exceed the limit, which avoids potential freezes for the remainder of the period. If we round up, we can get better utilization by fully using the quota, and this theoretically might better indicate to monitoring systems that we are starved of CPU. I’ve selected the latter approach of rounding up for consistency with the GOMAXPROCS=2 rationale below, but I don’t feel strongly. In particular, uber-go/automaxprocs#13 and #33803 (comment) makes reasonable arguments for rounding down based on the assumption that fractional requests are intended to support additional small processes outside of the Go application.
The CPU quota limit specifies a minimum of GOMAXPROCS=2. That is, with a quota less than or equal to 1, we will round up GOMAXPROCS all the way to 2. GOMAXPROCS=1 disables all parallelism in the Go scheduler, which can cause surprising effects like GC workers temporarily “pausing” the application while the Go runtime switches back and forth between application goroutines and GC worker goroutines. Additionally, I consider a CPU quota less than 1 to be an indication that a workload is bursty, since it must be to avoid hitting the limit. Thus we can take advantage of the bursty nature to allow the runtime itself to burst and avoid GOMAXPROCS=1 pitfalls. If the number of logical or affinity mask CPUs is 1, we will still set GOMAXPROCS=1, as there is definitely no additional parallelism available.
Currently, if you wish to reset GOMAXPROCS to the default value (such as to override the GOMAXPROCS environment variable), you use
runtime.GOMAXPROCS(runtime.NumCPU())
. Note thatNumCPU
already takes CPU affinity into account.All inputs to
runtime.GOMAXPROCS
are already well defined (>0 set to the passed value, <=0 returns the current value), so I’ve defined a new function (runtime.SetDefaultGOMAXPROCS
) which performs the lookup and updates GOMAXPROCS. I’m not particularly attached to this. Some alternatives include:runtime.NumCPU
to also consider the CPU quota. This feels like a bad idea to me because creating per-CPU caches would need an actual CPU count, not a quota, but that is extremely niche. Also, this call is currently defined as never updating after process start. It would need to update if users want to use this to discover changes to the quota.runtime.CPUQuota
API that just returns the quota. This is a bit more flexible, as the result can be used for other purposes. The main downside is that reimplementing the default GOMAXPROCS behavior is complicated. Something like:runtime.GOMAXPROCS(min(runtime.NumCPU(), max(2, math.Ceil(runtime.CPUQuota()))))
runtime.DefaultGOMAXPROCS = math.Int32Max
that could be passed toruntime.GOMAXPROCS()
. This is questionably backwards compatible, but setting a huge GOMAXPROCS is likely so slow that I doubt anyone does so.The runtime will automatically update GOMAXPROCS if the CPU quota (or affinity) changes to accommodate container runtimes that change limits online. In particular, it would be unfortunate and confusing if CPU limits on a container are increased but the Go application never used the additional resources.
Automatic updates to GOMAXPROCS consider changes to the CPU affinity mask in addition to cgroup limit. Theoretically affinity changes should be reflected in
runtime.NumCPU
, but that call is defined as never changing after startup, which is unfortunate.Implementation note: The CPU cgroup configuration files (
cpu.cfs_quota_us
, etc) do not support an explicit notification mechanism (such aspoll(2)
) when their values change. The only usable notification mechanisms are vfs file watches likeinotify(7)
/fanotify(7)
.sched_getaffinity(2)
also has no notification mechanism. As a result, my current thinking is that we will detect changes via low frequency reread of the files in sysmon. We will scan with a minimum period of 30s, up to the maximum sysmon period (1 minute, due to forced GCs).Note that container runtimes often mount a limited cgroupfs revealing only the leaf cgroup. This would prevent the runtime from walking up the hierarchy to check for more restrictive parent cgroups. I suspect this is a minor issue, as expect it is rate to have more restrictive parent cgroups, as that doesn't have much utility. It may even be OK to simply ignore parent cgroups at all times.
A major downside of this proposal is that it has no impact on container runtimes users that set a CPU request but no limit. This is a very common configuration, and will have no change from the status quo, which is unfortunate (note that Uber’s automaxprocs also does nothing for those users). Still, this proposal is better for users that do set a limit, and should not impede future changes for users with only a request.
This proposal is primarily limited to Linux. If other OSes have similar CPU limit mechanisms, I think it would make sense to support those as well in future proposals. The automatic updates of GOMAXPROCS based on changes to CPU scheduling affinity will affect all OSes.
Today’s GOMAXPROCS default is conceptually close to static (“number of CPUs”). With this proposal, I see the runtime moving more towards a dynamic “the runtime selects a good value” approach, which I think is a beneficial move to make if we want to make additional changes in the future, such as eliminating a fixed GOMAXPROCS entirely.
Comparison to
go.uber.org/automaxprocs
go.uber.org/automaxprocs
is a popular package for automatically setting GOMAXPROCS for container workloads. This proposal is very similar toautomaxprocs
, effectively an upstream version. For completeness, the major differences between this proposal andautomaxprocs
are:automaxprocs
by default has a minimum GOMAXPROCS of 1. This proposal has a minimum of 2.automaxprocs
by default rounds fractional limits down. This proposal rounds up.automaxprocs
by default logs changes it makes withlog.Printf
. This proposal does not log.automaxprocs
is more configurable (minimum GOMAXPROCS, rounding, logging). None of these are configurable in this proposal.automaxprocs
does not automatically update GOMAXPROCS when the quota changes. This proposal does.automaxprocs
always uses the CPU quota to set GOMAXPROCS. This proposal will use the number of logical CPUs or CPUs available insched_getaffinity(2)
if those are less than the CPU quota.automaxprocs
does not appear to support mixed cgroup v1 and v2 controllers. This proposal does.Open questions
GOMAXPROCS
cfs-aware onGOOS=linux
#33803 (comment) makes reasonable arguments for rounding down based on the assumption that fractional requests are intended to support additional small processes outside of the Go application.Implementation
I intend to implement this proposal for Go 1.25 if it is accepted.
Appendix
GOMAXPROCS and CPU quota difference example
To illustration the potential downside of setting GOMAXPROCS to the CPU limit for bursty applications, consider an idealized example application:
quota = 200ms
,period = 100ms
.Without this proposal,
GOMAXPROCS=10
. When a request is received, the 50ms of work is spread across 10 goroutines and the request completes with a latency of 5ms. The application never exceeds the cgroup quota because it only uses 50ms of CPU time in each 100ms period.With this proposal,
GOMAXPROCS=2
. When a request is received, the 50ms of work is spread across 2 goroutunes and the request completes with a latency of 25ms. The application again never exceeds the cgroup quota.This is a fairly extreme example, with the application completely idle most of the time, but that isn't necessary to get subpar behavior.
The "CPU limit" that we compute from
quota / period
is simply the average parallelism available for an entire period. Any subsection of the period where an applications uses less than the average CPU limit provides headroom to use more CPU later in the same period. e.g., with a CPU limit of 4, if an application uses only 2 CPU for the first half of the period, it could use 6 CPU for the second half and still achieve an average of 4.With
GOMAXPROCS
set to the average CPU limit, Go prevents using more than the average parallelism at any point, so any headroom gained from using less is simply wasted.CPU request based GC CPU
This is out of scope for this proposal, but one partially-baked idea to help users with a CPU request but no limit is to leave GOMAXPROCS alone, but to restrict the GC’s CPU target and idle workers to the CPU request.
Today, while the GC is running, the Go runtime will use 25% of GOMAXPROCS as “dedicated” GC workers. i.e., it will run on 25% of Ps even if it needs to deschedule user goroutines to do so. Beyond that, it will run “idle” workers on every available idle P. This can cause big CPU spikes described above.
Instead, we could run dedicated workers on 25% of “CPU request” Ps, and only idle workers on Ps up to a total CPU usage of the CPU request. This would resolve the big spikes caused by the GC.
The primary risk of this change is if the GC can no longer keep up with the allocation rate of application, then the GC will force GC assists on user goroutines. This is most likely if the application is using more than its CPU request. One possible mitigation for this would be to adjust up the GC target CPU if the application is using more than the CPU request.
This depends on having a mechanism to determine what the container CPU request value is.
Note that this is an internal implementation detail, so it does not need to be a proposal.
cc @golang/runtime, @sywhang @chabbimilind (for
automaxprocs
), @thepudds (for previous prototype experience), @thockin (for Kubernetes)Footnotes
Go intends for system calls and cgo calls that remain on CPU to count towards GOMAXPROCS. However, it is difficult to efficiently determine if these calls are blocked (off CPU), so the runtime uses a heuristic that assumes that calls that take a long wall time are blocked. ↩
The text was updated successfully, but these errors were encountered: