Skip to content

Commit af357ca

Browse files
ferris-cxwangjianyu.wjy
andauthored
docs: pod resources proxy (#2292)
Signed-off-by: wangjianyu.wjy <[email protected]> Co-authored-by: wangjianyu.wjy <[email protected]>
1 parent a12423b commit af357ca

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
title: Pod Resources Proxy
3+
authors:
4+
- "@ZiMengSheng"
5+
- "@ferris-cx"
6+
reviewers:
7+
- "@hormes"
8+
- "@saintube"
9+
creation-date: 2024-12-10
10+
last-updated: 2025-02-17
11+
---
12+
# Pod Resources Proxy
13+
14+
## Motivation
15+
In the Kubernetes community, devices are allocated by Kubelet, but the monitoring of these devices and how to make them visible within containers is left to the device vendors to customize.
16+
For example, the monitoring of GPUs is managed by components such as NVIDIA's DCGM Exporter, while inserting network interfaces into a Pod's net namespace is handled by components like Multus-CNI.
17+
18+
The question is: how can components like Multus-CNI or DCGM Exporter know which devices Kubelet has allocated to Pods? This requires the device allocator to provide some interfaces to expose this information.
19+
20+
Fortunately, The kubelet already provides the pod-resources endpoint, which allows third-party consumers to inspect the mapping between devices and pods.
21+
This interface has been adopted by Multus-CNI and DCGM Exporter to obtain the network cards and GPUs allocated to pods.
22+
23+
However, Koordinator uses a centralized scheduler to allocate devices, and the kubelet does not have device allocation information, so adaptation is required.
24+
25+
### Goals
26+
27+
1. Provides an interface that is exactly the same as the Kubelet PodResources interface, but with a different socket address.
28+
This allows third-party consumers to query the device allocation for Pods in Koordinator without modifying any code.
29+
2. Provides the modifications that Multus and DCGM need to make in their deployment YAML files to obtain device allocation information for Pods in Koordinator.
30+
31+
## Proposal
32+
33+
### User Stories
34+
35+
#### Story 1
36+
37+
As a user, when using Koordinator to allocate RDMA NICs for Pods, I can simply modify the deployment file of Multus-CNI as follows:
38+
39+
```yaml
40+
apiVersion: apps/v1
41+
kind: DaemonSet
42+
metadata:
43+
name: kube-multus-ds
44+
namespace: kube-system
45+
labels:
46+
tier: node
47+
app: multus
48+
name: multus
49+
spec:
50+
selector:
51+
matchLabels:
52+
name: multus
53+
updateStrategy:
54+
type: RollingUpdate
55+
template:
56+
metadata:
57+
labels:
58+
tier: node
59+
app: multus
60+
name: multus
61+
spec:
62+
containers:
63+
- name: kube-multus
64+
volumeMounts:
65+
...
66+
- name: host-var-lib-kubelet
67+
mountPath: /var/lib/kubelet/pod-resources
68+
mountPropagation: HostToContainer
69+
...
70+
volumes:
71+
...
72+
- name: host-var-lib-kubelet
73+
hostPath:
74+
path: /var/run/koordlet/pod-resources
75+
...
76+
```
77+
78+
#### Story 2
79+
As a user, I have adopted Koordinator to allocate GPUs for Pods. When I want to monitor the GPUs using DCGM, I can simply modify the DCGM deployment YAML file as follows:
80+
81+
```yaml
82+
apiVersion: apps/v1
83+
kind: DaemonSet
84+
metadata:
85+
name: dcgm-exporter
86+
namespace: dcgm-namespace
87+
labels:
88+
app.kubernetes.io/component: "dcgm-exporter"
89+
spec:
90+
updateStrategy:
91+
type: RollingUpdate
92+
selector:
93+
matchLabels:
94+
app.kubernetes.io/component: "dcgm-exporter"
95+
template:
96+
metadata:
97+
labels:
98+
app.kubernetes.io/component: "dcgm-exporter"
99+
spec:
100+
volumes:
101+
- name: "pod-gpu-resources"
102+
hostPath:
103+
path: "/var/lib/kubelet/pod-resources"
104+
containers:
105+
- name: exporter
106+
volumeMounts:
107+
- name: "pod-gpu-resources"
108+
readOnly: true
109+
mountPath: "/var/lib/kubelet/pod-resources"
110+
```
111+
112+
### Design
113+
114+
In Koordinator, the device information allocated to Pods is included in the annotations of the Pods. Therefore, we can create a proxy at the node level that will:
115+
116+
1. Access the Kubelet's PodResources interface to obtain the raw results.
117+
2. Access the Kubelet's /pods interface to retrieve the annotation information for all Pods on the node.
118+
3. Parse step 2 to extract the device information allocated to the Pods, fill this information into the results obtained in step 1, and return the result to the caller.
119+
120+
We implement this proxy in the statesInformer module of Koordlet.
121+
122+
## Alternatives
123+
124+
### Modifying Multus-CNI or DCGM Code
125+
This approach is considered quite invasive for Multus or DCGM, requiring Koordinator to fork the corresponding repository code, resulting in high maintenance costs.
126+
### Obtaining Device Information Allocated to Pods Through the NRI Interface
127+
The NRI's RunPodSandbox interface call occurs after the CNI interface. Therefore, using this method, Multus-CNI cannot obtain device allocation information at the time the CNI Add is called.
128+
### Using DRA Allocation Logic
129+
Currently, DRA does not support simply using extended resources to request resources. Additionally, on the node side, Multus-CNI or DCGM needs to be aware of the relevant allocation information of Resource Claims.

0 commit comments

Comments
 (0)