Skip to content

Commit 04f0736

Browse files
author
Eric Ernst
committed
keps: sig-node: initial pod overhead proposal
Signed-off-by: Eric Ernst <[email protected]>
1 parent fc820ae commit 04f0736

File tree

1 file changed

+300
-0
lines changed

1 file changed

+300
-0
lines changed
+300
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
---
2+
title: KEP Template
3+
authors:
4+
- "@egernst"
5+
owning-sig: sig-node
6+
participating-sigs:
7+
reviewers:
8+
- "@tallclair"
9+
- "@derekwaynecarr"
10+
- "@dchen1107"
11+
approvers:
12+
- TBD
13+
editor: TBD
14+
creation-date: 2019-02-26
15+
last-updated: 2019-02-26
16+
status: provisional
17+
---
18+
19+
# pod overhead
20+
21+
This includes the Summary and Motivation sections.
22+
23+
## Table of Contents
24+
25+
Tools for generating: https://github.com/ekalinin/github-markdown-toc
26+
27+
## Release Signoff Checklist
28+
29+
- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
30+
- [ ] KEP approvers have set the KEP status to `implementable`
31+
- [ ] Design details are appropriately documented
32+
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
33+
- [ ] Graduation criteria is in place
34+
- [ ] "Implementation History" section is up-to-date for milestone
35+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
36+
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
37+
38+
**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those
39+
approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
40+
41+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
42+
43+
[kubernetes.io]: https://kubernetes.io/
44+
[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues
45+
[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
46+
[kubernetes/website]: https://github.com/kubernetes/website
47+
48+
## Summary
49+
50+
Sandbox runtimes introduce a non-negligible overhead at the pod level which must be accounted for
51+
effective scheduling, resource quota management, and constraining.
52+
53+
## Motivation
54+
55+
Pods have some resource overhead. In our traditional linux container (Docker) approach,
56+
the accounted overhead is limited to the infra (pause) container, but also invokes some
57+
overhead accounted to various system components including: Kubelet (control loops), Docker,
58+
kernel (various resources), fluentd (logs). The current approach is to reserve a chunk
59+
of resources for the system components (system-reserved, kube-reserved, fluentd resource
60+
request padding), and ignore the (relatively small) overhead from the pause container, but
61+
this approach is heuristic at best and doesn't scale well.
62+
63+
With sandbox pods, the pod overhead potentially becomes much larger, maybe O(100mb). For
64+
example, Kata containers must run a guest kernel, kata agent, init system, etc. Since this
65+
overhead is too big to ignore, we need a way to account for it, starting from quota enforcement
66+
and scheduling.
67+
68+
### Goals
69+
70+
* Provide a mechanism for accounting pod overheads which are specific to a given runtime solution
71+
72+
### Non-Goals
73+
74+
* Making runtimeClass selections
75+
76+
## Proposal
77+
78+
Augment the RuntimeClass custom resource definition and the `PodSpec` to introduce the field
79+
`Overhead *ResourceRequirements`. This field represents the overhead associated with running a pod
80+
for a given runtimeClass. A mutating admission controller is introduced which will update the `Overhead`
81+
field in the workload's `PodSpec` to match what is provided for the selected RuntimeClass, if one is specified.
82+
83+
Kubelet's creation of the pod cgroup will be calculated as the sum of container `ResourceRequirements` fields,
84+
plus the Overhead associated with the pod.
85+
86+
The scheduler, resource quota handling and Kubelet's pod cgroup creation will take Overhead into account, as well
87+
as the sum of the pod's container requests.
88+
89+
### API Design
90+
91+
#### Pod overhead
92+
Introduce a Pod.Spec.Resources field on the pod to specify the pods overhead.
93+
94+
```
95+
Pod {
96+
Spec PodSpec {
97+
// Overhead is the resource overhead consumed by the Pod, not including
98+
// container resource usage. Users should leave this field unset.
99+
// +optional
100+
Overhead *ResourceRequirements
101+
}
102+
}
103+
```
104+
105+
For scheduling, the pod resource requests are added to the container resource requests.
106+
107+
We don't currently enforce resource limits on the pod cgroup, but this becomes feasible once
108+
pod overhead is accountable. If the pod specifies a resource limit, and all containers in the
109+
pod specify a limit, then the sum of those limits becomes a pod-level limit, enforced through the
110+
pod cgroup.
111+
112+
Users are not expected to manually set the pod resources; if a runtimeClass is being utilized,
113+
the manual value will be discarded. . See RuntimeController for the proposal for setting these
114+
resources.
115+
116+
### RuntimeClass CRD changes
117+
118+
Expand the runtimeClass CRD to include sandbox overheads:
119+
120+
```
121+
openAPIV3Schema:
122+
properties:
123+
spec:
124+
properties:
125+
runtimeHandler:
126+
type: string
127+
Pattern: '^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)?$'
128+
+ runtimeCpuReqOverhead:
129+
+ type: string
130+
+ pattern: '^([0-9]+([.][0-9])?)|[0-9]+(m)$'
131+
+ runtimeCpuLimitOverhead:
132+
+ type: string
133+
+ pattern: '^([0-9]+([.][0-9])?)|[0-9]+(m)$'
134+
+ runtimeMemoryReqOverhead:
135+
+ type: string
136+
+ pattern: '^[0-9]+([.][0-9]+)+(Mi|Gi|M|G)$'
137+
+ runtimeMemoryLimitOverhead:
138+
+ type: string
139+
+ pattern: '^[0-9]+([.][0-9]+)+(Mi|Gi|M|G)$'
140+
```
141+
142+
### ContainerRuntime controller
143+
144+
The pod resource overhead must be defined prior to scheduling, and we shouldn't make the user
145+
do it. To that end, I'm proposing a new mutating admission controller: ContainerRuntime.
146+
147+
The ContainerRuntime controller will have a single job: set the pod overhead field in the workload's
148+
PodSpec according to the runtimeClass specified.
149+
150+
It is expected that only the ContainerRuntime controller will set Pod.Spec.Overhead. If a prior value exists,
151+
the final pod spec will take the larger of what is defined in the runtimeClass and the original value.
152+
153+
Going forward, I foresee additional controller scope around runtimeClass:
154+
validating the runtimeClass selection: This would require applying some kind of pod-characteristic labels (runtimeClass selectors?) which would then be consumed by an admission controller and checked against known capabilities on a per runtimeClass basis. This is is beyond the scope of this proposal.
155+
Automatic runtimeClass selection: A controller could exist which would attempt to automatically select the most appropriate runtimeClass for the given pod. This, again, is beyond the scope of this proposal.
156+
157+
### User Stories [optional]
158+
159+
Detail the things that people will be able to do if this KEP is implemented.
160+
Include as much detail as possible so that people can understand the "how" of the system.
161+
The goal here is to make this feel real for users without getting bogged down.
162+
163+
#### Story 2
164+
165+
### Implementation Details/Notes/Constraints [optional]
166+
167+
What are the caveats to the implementation?
168+
What are some important details that didn't come across above.
169+
Go in to as much detail as necessary here.
170+
This might be a good place to talk about core concepts and how they releate.
171+
172+
### Risks and Mitigations
173+
174+
What are the risks of this proposal and how do we mitigate.
175+
Think broadly.
176+
For example, consider both security and how this will impact the larger kubernetes ecosystem.
177+
178+
How will security be reviewed and by whom?
179+
How will UX be reviewed and by whom?
180+
181+
Consider including folks that also work outside the SIG or subproject.
182+
183+
## Design Details
184+
185+
### Test Plan
186+
187+
**Note:** *Section not required until targeted at a release.*
188+
189+
Consider the following in developing a test plan for this enhancement:
190+
- Will there be e2e and integration tests, in addition to unit tests?
191+
- How will it be tested in isolation vs with other components?
192+
193+
No need to outline all of the test cases, just the general strategy.
194+
Anything that would count as tricky in the implementation and anything particularly challenging to test should be called out.
195+
196+
All code is expected to have adequate tests (eventually with coverage expectations).
197+
Please adhere to the [Kubernetes testing guidelines][testing-guidelines] when drafting this test plan.
198+
199+
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
200+
201+
### Graduation Criteria
202+
203+
**Note:** *Section not required until targeted at a release.*
204+
205+
Define graduation milestones.
206+
207+
These may be defined in terms of API maturity, or as something else. Initial KEP should keep
208+
this high-level with a focus on what signals will be looked at to determine graduation.
209+
210+
Consider the following in developing the graduation criteria for this enhancement:
211+
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
212+
- [Deprecation policy][deprecation-policy]
213+
214+
Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning),
215+
or by redefining what graduation means.
216+
217+
In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed.
218+
219+
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
220+
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
221+
222+
#### Examples
223+
224+
These are generalized examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
225+
226+
##### Alpha -> Beta Graduation
227+
228+
- Gather feedback from developers and surveys
229+
- Complete features A, B, C
230+
- Tests are in Testgrid and linked in KEP
231+
232+
##### Beta -> GA Graduation
233+
234+
- N examples of real world usage
235+
- N installs
236+
- More rigorous forms of testing e.g., downgrade tests and scalability tests
237+
- Allowing time for feedback
238+
239+
**Note:** Generally we also wait at least 2 releases between beta and GA/stable, since there's no opportunity for user feedback, or even bug reports, in back-to-back releases.
240+
241+
##### Removing a deprecated flag
242+
243+
- Announce deprecation and support policy of the existing flag
244+
- Two versions passed since introducing the functionality which deprecates the flag (to address version skew)
245+
- Address feedback on usage/changed behavior, provided on GitHub issues
246+
- Deprecate the flag
247+
248+
**For non-optional features moving to GA, the graduation criteria must include [conformance tests].**
249+
250+
[conformance tests]: https://github.com/kubernetes/community/blob/master/contributors/devel/conformance-tests.md
251+
252+
### Upgrade / Downgrade Strategy
253+
254+
If applicable, how will the component be upgraded and downgraded? Make sure this is in the test plan.
255+
256+
Consider the following in developing an upgrade/downgrade strategy for this enhancement:
257+
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior?
258+
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement?
259+
260+
### Version Skew Strategy
261+
262+
Set the overhead to the max of the two version until the rollout is complete. This may be more problematic
263+
if a new version increases (rather than decreases) the required resources.
264+
265+
## Implementation History
266+
267+
Major milestones in the life cycle of a KEP should be tracked in `Implementation History`.
268+
Major milestones might include
269+
270+
- the `Summary` and `Motivation` sections being merged signaling SIG acceptance
271+
- the `Proposal` section being merged signaling agreement on a proposed design
272+
- the date implementation started
273+
- the first Kubernetes release where an initial version of the KEP was available
274+
- the version of Kubernetes where the KEP graduated to general availability
275+
- when the KEP was retired or superseded
276+
277+
## Drawbacks [optional]
278+
279+
This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.
280+
281+
## Alternatives [optional]
282+
283+
In order to achieve proper handling of sandbox runtimes, the scheduler/resourceQuota handling needs to take
284+
into account the overheads associated with running a particular runtimeClass.
285+
286+
### Leaving the PodSpec unchaged
287+
288+
Instead of tracking the overhead associated with running a workload with a given runtimeClass in the PodSpec,
289+
the Kubelet (for pod cgroup creation), the scheduler (for honoring reqests overhead for the pod) and the resource
290+
quota handling (for optionally taking requests overhead of a workload into account) will need to be augmented
291+
to add a sandbox overhead when applicable.
292+
293+
Pros:
294+
* no changes to the pod spec
295+
* no need for a mutating admission controller
296+
297+
Cons:
298+
* handling of the pod overhead is spread out across a few components
299+
* Not user perceptible from a workload perspective.
300+

0 commit comments

Comments
 (0)