Skip to content

Commit 6a2813f

Browse files
authored
Merge pull request #3923 from cclerget/issue-3922
Set temporary single CPU affinity before cgroup cpuset transition.
2 parents d0f803e + afc23e3 commit 6a2813f

File tree

14 files changed

+954
-2
lines changed

14 files changed

+954
-2
lines changed
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
## Isolated CPU affinity transition
2+
3+
The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
4+
in 5.7 has affected a deterministic scheduling behavior by distributing tasks
5+
across CPU cores within a cgroups cpuset. It means that `runc exec` might be
6+
impacted under some circumstances, by example when a container has been
7+
created within a cgroup cpuset entirely composed of isolated CPU cores
8+
usually sets either with `nohz_full` and/or `isolcpus` kernel boot parameters.
9+
10+
Some containerized real-time applications are relying on this deterministic
11+
behavior and uses the first CPU core to run a slow thread while other CPU
12+
cores are fully used by the real-time threads with SCHED_FIFO policy.
13+
Such applications can prevent runc process from joining a container when the
14+
runc process is randomly scheduled on a CPU core owned by a real-time thread.
15+
16+
Runc introduces a way to restore this behavior by adding the following
17+
annotation to the container runtime spec (`config.json`):
18+
19+
`org.opencontainers.runc.exec.isolated-cpu-affinity-transition`
20+
21+
This annotation can take one of those values:
22+
23+
* `temporary` to temporarily set the runc process CPU affinity to the first
24+
isolated CPU core of the container cgroup cpuset.
25+
* `definitive`: to definitively set the runc process CPU affinity to the first
26+
isolated CPU core of the container cgroup cpuset.
27+
28+
For example:
29+
30+
```json
31+
"annotations": {
32+
"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"
33+
}
34+
```
35+
36+
__WARNING:__ `definitive` requires a kernel >= 6.2, also works with RHEL 9 and
37+
above.
38+
39+
### How it works?
40+
41+
When enabled and during `runc exec`, runc is looking for the `nohz_full` kernel
42+
boot parameter value and considers the CPUs in the list as isolated, it doesn't
43+
look for `isolcpus` boot parameter, it just assumes that `isolcpus` value is
44+
identical to `nohz_full` when specified. If `nohz_full` parameter is not found,
45+
runc also attempts to read the list from `/sys/devices/system/cpu/nohz_full`.
46+
47+
Once it gets the isolated CPU list, it returns an eligible CPU core within the
48+
container cgroup cpuset based on those heuristics:
49+
50+
* when there is not cpuset cores: no eligible CPU
51+
* when there is not isolated cores: no eligible CPU
52+
* when cpuset cores are not in isolated core list: no eligible CPU
53+
* when cpuset cores are all isolated cores: return the first CPU of the cpuset
54+
* when cpuset cores are mixed between housekeeping/isolated cores: return the
55+
first housekeeping CPU not in isolated CPUs.
56+
57+
The returned CPU core is then used to set the `runc init` CPU affinity before
58+
the container cgroup cpuset transition.
59+
60+
#### Transition example
61+
62+
`nohz_full` has the isolated cores `4-7`. A container has been created with
63+
the cgroup cpuset `4-7` to only run on the isolated CPU cores 4 to 7.
64+
`runc exec` is called by a process with CPU affinity set to `0-3`
65+
66+
* with `temporary` transition:
67+
68+
runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4-7)
69+
70+
* with `definitive` transition:
71+
72+
runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4)
73+
74+
The difference between `temporary` and `definitive` is the container process
75+
affinity, `definitive` will constraint the container process to run on the
76+
first isolated CPU core of the cgroup cpuset, while `temporary` restore the
77+
CPU affinity to match the container cgroup cpuset.
78+
79+
`definitive` transition might be helpful when `nohz_full` is used without
80+
`isolcpus` to avoid runc and container process to be a noisy neighbour for
81+
real-time applications.
82+
83+
### How to use it with Kubernetes?
84+
85+
Kubernetes doesn't manage container directly, instead it uses the Container Runtime
86+
Interface (CRI) to communicate with a software implementing this interface and responsible
87+
to manage the lifecycle of containers. There are popular CRI implementations like Containerd
88+
and CRI-O. Those implementations allows to pass pod annotations to the container runtime
89+
via the container runtime spec. Currently runc is the runtime used by default for both.
90+
91+
#### Containerd configuration
92+
93+
Containerd CRI uses runc by default but requires an extra step to pass the annotation to runc.
94+
You have to whitelist `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` as a pod
95+
annotation allowed to be passed to the container runtime in `/etc/containerd/config.toml`:
96+
97+
```toml
98+
[plugins."io.containerd.grpc.v1.cri".containerd]
99+
default_runtime_name = "runc"
100+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
101+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
102+
runtime_type = "io.containerd.runc.v2"
103+
base_runtime_spec = "/etc/containerd/cri-base.json"
104+
pod_annotations = ["org.opencontainers.runc.exec.isolated-cpu-affinity-transition"]
105+
```
106+
107+
#### CRI-O configuration
108+
109+
CRI-O doesn't require any extra step, however some annotations could be excluded by
110+
configuration.
111+
112+
#### Pod deployment example
113+
114+
```yaml
115+
apiVersion: v1
116+
kind: Pod
117+
metadata:
118+
name: demo-pod
119+
annotations:
120+
org.opencontainers.runc.exec.isolated-cpu-affinity-transition: "temporary"
121+
spec:
122+
containers:
123+
- name: demo
124+
image: registry.com/demo:latest
125+
```

features.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ var featuresCommand = cli.Command{
6868
"bundle",
6969
"org.systemd.property.", // prefix form
7070
"org.criu.config",
71+
"org.opencontainers.runc.exec.isolated-cpu-affinity-transition",
7172
},
7273
}
7374

libcontainer/cgroups/cgroups.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,8 @@ type Manager interface {
7171

7272
// OOMKillCount reports OOM kill count for the cgroup.
7373
OOMKillCount() (uint64, error)
74+
75+
// GetEffectiveCPUs returns the effective CPUs of the cgroup, an empty
76+
// value means that the cgroups cpuset subsystem/controller is not enabled.
77+
GetEffectiveCPUs() string
7478
}

libcontainer/cgroups/fs/fs.go

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ import (
44
"errors"
55
"fmt"
66
"os"
7+
"path/filepath"
8+
"strings"
79
"sync"
810

911
"golang.org/x/sys/unix"
@@ -263,3 +265,28 @@ func (m *Manager) OOMKillCount() (uint64, error) {
263265

264266
return c, err
265267
}
268+
269+
func (m *Manager) GetEffectiveCPUs() string {
270+
return GetEffectiveCPUs(m.Path("cpuset"), m.cgroups)
271+
}
272+
273+
func GetEffectiveCPUs(cpusetPath string, cgroups *configs.Cgroup) string {
274+
// Fast path.
275+
if cgroups.CpusetCpus != "" {
276+
return cgroups.CpusetCpus
277+
} else if !strings.HasPrefix(cpusetPath, defaultCgroupRoot) {
278+
return ""
279+
}
280+
281+
// Iterates until it goes to the cgroup root path.
282+
// It's required for containers in which cpuset controller
283+
// is not enabled, in this case a parent cgroup is used.
284+
for path := cpusetPath; path != defaultCgroupRoot; path = filepath.Dir(path) {
285+
cpus, err := fscommon.GetCgroupParamString(path, "cpuset.effective_cpus")
286+
if err == nil {
287+
return cpus
288+
}
289+
}
290+
291+
return ""
292+
}

libcontainer/cgroups/fs2/fs2.go

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,13 @@ import (
44
"errors"
55
"fmt"
66
"os"
7+
"path/filepath"
78
"strings"
89

910
"github.com/opencontainers/runc/libcontainer/cgroups"
1011
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
1112
"github.com/opencontainers/runc/libcontainer/configs"
13+
"github.com/opencontainers/runc/libcontainer/utils"
1214
)
1315

1416
type parseError = fscommon.ParseError
@@ -32,6 +34,9 @@ func NewManager(config *configs.Cgroup, dirPath string) (*Manager, error) {
3234
if err != nil {
3335
return nil, err
3436
}
37+
} else {
38+
// Clean path for safety.
39+
dirPath = utils.CleanPath(dirPath)
3540
}
3641

3742
m := &Manager{
@@ -316,3 +321,26 @@ func CheckMemoryUsage(dirPath string, r *configs.Resources) error {
316321

317322
return nil
318323
}
324+
325+
func (m *Manager) GetEffectiveCPUs() string {
326+
// Fast path.
327+
if m.config.CpusetCpus != "" {
328+
return m.config.CpusetCpus
329+
} else if !strings.HasPrefix(m.dirPath, UnifiedMountpoint) {
330+
return ""
331+
}
332+
333+
// Iterates until it goes outside of the cgroup root path.
334+
// It's required for containers in which cpuset controller
335+
// is not enabled, in this case a parent cgroup is used.
336+
outsidePath := filepath.Dir(UnifiedMountpoint)
337+
338+
for path := m.dirPath; path != outsidePath; path = filepath.Dir(path) {
339+
cpus, err := fscommon.GetCgroupParamString(path, "cpuset.cpus.effective")
340+
if err == nil {
341+
return cpus
342+
}
343+
}
344+
345+
return ""
346+
}

libcontainer/cgroups/systemd/v1.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -411,3 +411,7 @@ func (m *LegacyManager) Exists() bool {
411411
func (m *LegacyManager) OOMKillCount() (uint64, error) {
412412
return fs.OOMKillCount(m.Path("memory"))
413413
}
414+
415+
func (m *LegacyManager) GetEffectiveCPUs() string {
416+
return fs.GetEffectiveCPUs(m.Path("cpuset"), m.cgroups)
417+
}

libcontainer/cgroups/systemd/v2.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -514,3 +514,7 @@ func (m *UnifiedManager) Exists() bool {
514514
func (m *UnifiedManager) OOMKillCount() (uint64, error) {
515515
return m.fsMgr.OOMKillCount()
516516
}
517+
518+
func (m *UnifiedManager) GetEffectiveCPUs() string {
519+
return m.fsMgr.GetEffectiveCPUs()
520+
}

libcontainer/container_linux_test.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,10 @@ func (m *mockCgroupManager) GetFreezerState() (configs.FreezerState, error) {
6969
return configs.Thawed, nil
7070
}
7171

72+
func (m *mockCgroupManager) GetEffectiveCPUs() string {
73+
return ""
74+
}
75+
7276
type mockProcess struct {
7377
_pid int
7478
started uint64

0 commit comments

Comments
 (0)