|
| 1 | +--- |
| 2 | +title: Pod Resources Proxy |
| 3 | +authors: |
| 4 | +- "@ZiMengSheng" |
| 5 | +- "@ferris-cx" |
| 6 | +reviewers: |
| 7 | +- "@hormes" |
| 8 | +- "@saintube" |
| 9 | +creation-date: 2024-12-10 |
| 10 | +last-updated: 2025-02-17 |
| 11 | +--- |
| 12 | +# Pod Resources Proxy |
| 13 | + |
| 14 | +## Motivation |
| 15 | +In the Kubernetes community, devices are allocated by Kubelet, but the monitoring of these devices and how to make them visible within containers is left to the device vendors to customize. |
| 16 | +For example, the monitoring of GPUs is managed by components such as NVIDIA's DCGM Exporter, while inserting network interfaces into a Pod's net namespace is handled by components like Multus-CNI. |
| 17 | + |
| 18 | +The question is: how can components like Multus-CNI or DCGM Exporter know which devices Kubelet has allocated to Pods? This requires the device allocator to provide some interfaces to expose this information. |
| 19 | + |
| 20 | +Fortunately, The kubelet already provides the pod-resources endpoint, which allows third-party consumers to inspect the mapping between devices and pods. |
| 21 | +This interface has been adopted by Multus-CNI and DCGM Exporter to obtain the network cards and GPUs allocated to pods. |
| 22 | + |
| 23 | +However, Koordinator uses a centralized scheduler to allocate devices, and the kubelet does not have device allocation information, so adaptation is required. |
| 24 | + |
| 25 | +### Goals |
| 26 | + |
| 27 | +1. Provides an interface that is exactly the same as the Kubelet PodResources interface, but with a different socket address. |
| 28 | +This allows third-party consumers to query the device allocation for Pods in Koordinator without modifying any code. |
| 29 | +2. Provides the modifications that Multus and DCGM need to make in their deployment YAML files to obtain device allocation information for Pods in Koordinator. |
| 30 | + |
| 31 | +## Proposal |
| 32 | + |
| 33 | +### User Stories |
| 34 | + |
| 35 | +#### Story 1 |
| 36 | + |
| 37 | +As a user, when using Koordinator to allocate RDMA NICs for Pods, I can simply modify the deployment file of Multus-CNI as follows: |
| 38 | + |
| 39 | +```yaml |
| 40 | +apiVersion: apps/v1 |
| 41 | +kind: DaemonSet |
| 42 | +metadata: |
| 43 | + name: kube-multus-ds |
| 44 | + namespace: kube-system |
| 45 | + labels: |
| 46 | + tier: node |
| 47 | + app: multus |
| 48 | + name: multus |
| 49 | +spec: |
| 50 | + selector: |
| 51 | + matchLabels: |
| 52 | + name: multus |
| 53 | + updateStrategy: |
| 54 | + type: RollingUpdate |
| 55 | + template: |
| 56 | + metadata: |
| 57 | + labels: |
| 58 | + tier: node |
| 59 | + app: multus |
| 60 | + name: multus |
| 61 | + spec: |
| 62 | + containers: |
| 63 | + - name: kube-multus |
| 64 | + volumeMounts: |
| 65 | + ... |
| 66 | + - name: host-var-lib-kubelet |
| 67 | + mountPath: /var/lib/kubelet/pod-resources |
| 68 | + mountPropagation: HostToContainer |
| 69 | + ... |
| 70 | + volumes: |
| 71 | + ... |
| 72 | + - name: host-var-lib-kubelet |
| 73 | + hostPath: |
| 74 | + path: /var/run/koordlet/pod-resources |
| 75 | + ... |
| 76 | +``` |
| 77 | + |
| 78 | +#### Story 2 |
| 79 | +As a user, I have adopted Koordinator to allocate GPUs for Pods. When I want to monitor the GPUs using DCGM, I can simply modify the DCGM deployment YAML file as follows: |
| 80 | + |
| 81 | +```yaml |
| 82 | +apiVersion: apps/v1 |
| 83 | +kind: DaemonSet |
| 84 | +metadata: |
| 85 | + name: dcgm-exporter |
| 86 | + namespace: dcgm-namespace |
| 87 | + labels: |
| 88 | + app.kubernetes.io/component: "dcgm-exporter" |
| 89 | +spec: |
| 90 | + updateStrategy: |
| 91 | + type: RollingUpdate |
| 92 | + selector: |
| 93 | + matchLabels: |
| 94 | + app.kubernetes.io/component: "dcgm-exporter" |
| 95 | + template: |
| 96 | + metadata: |
| 97 | + labels: |
| 98 | + app.kubernetes.io/component: "dcgm-exporter" |
| 99 | + spec: |
| 100 | + volumes: |
| 101 | + - name: "pod-gpu-resources" |
| 102 | + hostPath: |
| 103 | + path: "/var/lib/kubelet/pod-resources" |
| 104 | + containers: |
| 105 | + - name: exporter |
| 106 | + volumeMounts: |
| 107 | + - name: "pod-gpu-resources" |
| 108 | + readOnly: true |
| 109 | + mountPath: "/var/lib/kubelet/pod-resources" |
| 110 | +``` |
| 111 | +
|
| 112 | +### Design |
| 113 | +
|
| 114 | +In Koordinator, the device information allocated to Pods is included in the annotations of the Pods. Therefore, we can create a proxy at the node level that will: |
| 115 | +
|
| 116 | +1. Access the Kubelet's PodResources interface to obtain the raw results. |
| 117 | +2. Access the Kubelet's /pods interface to retrieve the annotation information for all Pods on the node. |
| 118 | +3. Parse step 2 to extract the device information allocated to the Pods, fill this information into the results obtained in step 1, and return the result to the caller. |
| 119 | +
|
| 120 | +We implement this proxy in the statesInformer module of Koordlet. |
| 121 | +
|
| 122 | +## Alternatives |
| 123 | +
|
| 124 | +### Modifying Multus-CNI or DCGM Code |
| 125 | +This approach is considered quite invasive for Multus or DCGM, requiring Koordinator to fork the corresponding repository code, resulting in high maintenance costs. |
| 126 | +### Obtaining Device Information Allocated to Pods Through the NRI Interface |
| 127 | +The NRI's RunPodSandbox interface call occurs after the CNI interface. Therefore, using this method, Multus-CNI cannot obtain device allocation information at the time the CNI Add is called. |
| 128 | +### Using DRA Allocation Logic |
| 129 | +Currently, DRA does not support simply using extended resources to request resources. Additionally, on the node side, Multus-CNI or DCGM needs to be aware of the relevant allocation information of Resource Claims. |
0 commit comments