|
| 1 | +## Isolated CPU affinity transition |
| 2 | + |
| 3 | +The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 |
| 4 | +in 5.7 has affected a deterministic scheduling behavior by distributing tasks |
| 5 | +across CPU cores within a cgroups cpuset. It means that `runc exec` might be |
| 6 | +impacted under some circumstances, by example when a container has been |
| 7 | +created within a cgroup cpuset entirely composed of isolated CPU cores |
| 8 | +usually sets either with `nohz_full` and/or `isolcpus` kernel boot parameters. |
| 9 | + |
| 10 | +Some containerized real-time applications are relying on this deterministic |
| 11 | +behavior and uses the first CPU core to run a slow thread while other CPU |
| 12 | +cores are fully used by the real-time threads with SCHED_FIFO policy. |
| 13 | +Such applications can prevent runc process from joining a container when the |
| 14 | +runc process is randomly scheduled on a CPU core owned by a real-time thread. |
| 15 | + |
| 16 | +Runc introduces a way to restore this behavior by adding the following |
| 17 | +annotation to the container runtime spec (`config.json`): |
| 18 | + |
| 19 | +`org.opencontainers.runc.exec.isolated-cpu-affinity-transition` |
| 20 | + |
| 21 | +This annotation can take one of those values: |
| 22 | + |
| 23 | +* `temporary` to temporarily set the runc process CPU affinity to the first |
| 24 | +isolated CPU core of the container cgroup cpuset. |
| 25 | +* `definitive`: to definitively set the runc process CPU affinity to the first |
| 26 | +isolated CPU core of the container cgroup cpuset. |
| 27 | + |
| 28 | +For example: |
| 29 | + |
| 30 | +```json |
| 31 | + "annotations": { |
| 32 | + "org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary" |
| 33 | + } |
| 34 | +``` |
| 35 | + |
| 36 | +__WARNING:__ `definitive` requires a kernel >= 6.2, also works with RHEL 9 and |
| 37 | +above. |
| 38 | + |
| 39 | +### How it works? |
| 40 | + |
| 41 | +When enabled and during `runc exec`, runc is looking for the `nohz_full` kernel |
| 42 | +boot parameter value and considers the CPUs in the list as isolated, it doesn't |
| 43 | +look for `isolcpus` boot parameter, it just assumes that `isolcpus` value is |
| 44 | +identical to `nohz_full` when specified. If `nohz_full` parameter is not found, |
| 45 | +runc also attempts to read the list from `/sys/devices/system/cpu/nohz_full`. |
| 46 | + |
| 47 | +Once it gets the isolated CPU list, it returns an eligible CPU core within the |
| 48 | +container cgroup cpuset based on those heuristics: |
| 49 | + |
| 50 | +* when there is not cpuset cores: no eligible CPU |
| 51 | +* when there is not isolated cores: no eligible CPU |
| 52 | +* when cpuset cores are not in isolated core list: no eligible CPU |
| 53 | +* when cpuset cores are all isolated cores: return the first CPU of the cpuset |
| 54 | +* when cpuset cores are mixed between housekeeping/isolated cores: return the |
| 55 | + first housekeeping CPU not in isolated CPUs. |
| 56 | + |
| 57 | +The returned CPU core is then used to set the `runc init` CPU affinity before |
| 58 | +the container cgroup cpuset transition. |
| 59 | + |
| 60 | +#### Transition example |
| 61 | + |
| 62 | +`nohz_full` has the isolated cores `4-7`. A container has been created with |
| 63 | +the cgroup cpuset `4-7` to only run on the isolated CPU cores 4 to 7. |
| 64 | +`runc exec` is called by a process with CPU affinity set to `0-3` |
| 65 | + |
| 66 | +* with `temporary` transition: |
| 67 | + |
| 68 | + runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4-7) |
| 69 | + |
| 70 | +* with `definitive` transition: |
| 71 | + |
| 72 | + runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4) |
| 73 | + |
| 74 | +The difference between `temporary` and `definitive` is the container process |
| 75 | +affinity, `definitive` will constraint the container process to run on the |
| 76 | +first isolated CPU core of the cgroup cpuset, while `temporary` restore the |
| 77 | +CPU affinity to match the container cgroup cpuset. |
| 78 | + |
| 79 | +`definitive` transition might be helpful when `nohz_full` is used without |
| 80 | +`isolcpus` to avoid runc and container process to be a noisy neighbour for |
| 81 | +real-time applications. |
| 82 | + |
| 83 | +### How to use it with Kubernetes? |
| 84 | + |
| 85 | +Kubernetes doesn't manage container directly, instead it uses the Container Runtime |
| 86 | +Interface (CRI) to communicate with a software implementing this interface and responsible |
| 87 | +to manage the lifecycle of containers. There are popular CRI implementations like Containerd |
| 88 | +and CRI-O. Those implementations allows to pass pod annotations to the container runtime |
| 89 | +via the container runtime spec. Currently runc is the runtime used by default for both. |
| 90 | + |
| 91 | +#### Containerd configuration |
| 92 | + |
| 93 | +Containerd CRI uses runc by default but requires an extra step to pass the annotation to runc. |
| 94 | +You have to whitelist `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` as a pod |
| 95 | +annotation allowed to be passed to the container runtime in `/etc/containerd/config.toml`: |
| 96 | + |
| 97 | +```toml |
| 98 | +[plugins."io.containerd.grpc.v1.cri".containerd] |
| 99 | + default_runtime_name = "runc" |
| 100 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] |
| 101 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] |
| 102 | + runtime_type = "io.containerd.runc.v2" |
| 103 | + base_runtime_spec = "/etc/containerd/cri-base.json" |
| 104 | + pod_annotations = ["org.opencontainers.runc.exec.isolated-cpu-affinity-transition"] |
| 105 | +``` |
| 106 | + |
| 107 | +#### CRI-O configuration |
| 108 | + |
| 109 | +CRI-O doesn't require any extra step, however some annotations could be excluded by |
| 110 | +configuration. |
| 111 | + |
| 112 | +#### Pod deployment example |
| 113 | + |
| 114 | +```yaml |
| 115 | +apiVersion: v1 |
| 116 | +kind: Pod |
| 117 | +metadata: |
| 118 | + name: demo-pod |
| 119 | + annotations: |
| 120 | + org.opencontainers.runc.exec.isolated-cpu-affinity-transition: "temporary" |
| 121 | +spec: |
| 122 | + containers: |
| 123 | + - name: demo |
| 124 | + image: registry.com/demo:latest |
| 125 | +``` |
0 commit comments