|
| 1 | +--- |
| 2 | +title: KEP Template |
| 3 | +authors: |
| 4 | + - "@egernst" |
| 5 | +owning-sig: sig-node |
| 6 | +participating-sigs: |
| 7 | +reviewers: |
| 8 | + - "@tallclair" |
| 9 | + - "@derekwaynecarr" |
| 10 | + - "@dchen1107" |
| 11 | +approvers: |
| 12 | + - TBD |
| 13 | +editor: TBD |
| 14 | +creation-date: 2019-02-26 |
| 15 | +last-updated: 2019-02-26 |
| 16 | +status: provisional |
| 17 | +--- |
| 18 | + |
| 19 | +# pod overhead |
| 20 | + |
| 21 | +This includes the Summary and Motivation sections. |
| 22 | + |
| 23 | +## Table of Contents |
| 24 | + |
| 25 | +Tools for generating: https://github.com/ekalinin/github-markdown-toc |
| 26 | + |
| 27 | +## Release Signoff Checklist |
| 28 | + |
| 29 | +- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) |
| 30 | +- [ ] KEP approvers have set the KEP status to `implementable` |
| 31 | +- [ ] Design details are appropriately documented |
| 32 | +- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 33 | +- [ ] Graduation criteria is in place |
| 34 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 35 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 36 | +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 37 | + |
| 38 | +**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those |
| 39 | +approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs). |
| 40 | + |
| 41 | +**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone. |
| 42 | + |
| 43 | +[kubernetes.io]: https://kubernetes.io/ |
| 44 | +[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues |
| 45 | +[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes |
| 46 | +[kubernetes/website]: https://github.com/kubernetes/website |
| 47 | + |
| 48 | +## Summary |
| 49 | + |
| 50 | +Sandbox runtimes introduce a non-negligible overhead at the pod level which must be accounted for |
| 51 | +effective scheduling, resource quota management, and constraining. |
| 52 | + |
| 53 | +## Motivation |
| 54 | + |
| 55 | +Pods have some resource overhead. In our traditional linux container (Docker) approach, |
| 56 | +the accounted overhead is limited to the infra (pause) container, but also invokes some |
| 57 | +overhead accounted to various system components including: Kubelet (control loops), Docker, |
| 58 | +kernel (various resources), fluentd (logs). The current approach is to reserve a chunk |
| 59 | +of resources for the system components (system-reserved, kube-reserved, fluentd resource |
| 60 | +request padding), and ignore the (relatively small) overhead from the pause container, but |
| 61 | +this approach is heuristic at best and doesn't scale well. |
| 62 | + |
| 63 | +With sandbox pods, the pod overhead potentially becomes much larger, maybe O(100mb). For |
| 64 | +example, Kata containers must run a guest kernel, kata agent, init system, etc. Since this |
| 65 | +overhead is too big to ignore, we need a way to account for it, starting from quota enforcement |
| 66 | +and scheduling. |
| 67 | + |
| 68 | +### Goals |
| 69 | + |
| 70 | +* Provide a mechanism for accounting pod overheads which are specific to a given runtime solution |
| 71 | + |
| 72 | +### Non-Goals |
| 73 | + |
| 74 | +* Making runtimeClass selections |
| 75 | + |
| 76 | +## Proposal |
| 77 | + |
| 78 | +Augment the RuntimeClass custom resource definition and the `PodSpec` to introduce the field |
| 79 | +`Overhead *ResourceRequirements`. This field represents the overhead associated with running a pod |
| 80 | +for a given runtimeClass. A mutating admission controller is introduced which will update the `Overhead` |
| 81 | +field in the workload's `PodSpec` to match what is provided for the selected RuntimeClass, if one is specified. |
| 82 | + |
| 83 | +Kubelet's creation of the pod cgroup will be calculated as the sum of container `ResourceRequirements` fields, |
| 84 | +plus the Overhead associated with the pod. |
| 85 | + |
| 86 | +The scheduler, resource quota handling and Kubelet's pod cgroup creation will take Overhead into account, as well |
| 87 | +as the sum of the pod's container requests. |
| 88 | + |
| 89 | +### API Design |
| 90 | + |
| 91 | +#### Pod overhead |
| 92 | +Introduce a Pod.Spec.Resources field on the pod to specify the pods overhead. |
| 93 | + |
| 94 | +``` |
| 95 | +Pod { |
| 96 | + Spec PodSpec { |
| 97 | + // Overhead is the resource overhead consumed by the Pod, not including |
| 98 | + // container resource usage. Users should leave this field unset. |
| 99 | + // +optional |
| 100 | + Overhead *ResourceRequirements |
| 101 | + } |
| 102 | +} |
| 103 | +``` |
| 104 | + |
| 105 | +For scheduling, the pod resource requests are added to the container resource requests. |
| 106 | + |
| 107 | +We don't currently enforce resource limits on the pod cgroup, but this becomes feasible once |
| 108 | +pod overhead is accountable. If the pod specifies a resource limit, and all containers in the |
| 109 | +pod specify a limit, then the sum of those limits becomes a pod-level limit, enforced through the |
| 110 | +pod cgroup. |
| 111 | + |
| 112 | +Users are not expected to manually set the pod resources; if a runtimeClass is being utilized, |
| 113 | +the manual value will be discarded. . See RuntimeController for the proposal for setting these |
| 114 | +resources. |
| 115 | + |
| 116 | +### RuntimeClass CRD changes |
| 117 | + |
| 118 | +Expand the runtimeClass CRD to include sandbox overheads: |
| 119 | + |
| 120 | +``` |
| 121 | +openAPIV3Schema: |
| 122 | + properties: |
| 123 | + spec: |
| 124 | + properties: |
| 125 | + runtimeHandler: |
| 126 | + type: string |
| 127 | + Pattern: '^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)?$' |
| 128 | ++ runtimeCpuReqOverhead: |
| 129 | ++ type: string |
| 130 | ++ pattern: '^([0-9]+([.][0-9])?)|[0-9]+(m)$' |
| 131 | ++ runtimeCpuLimitOverhead: |
| 132 | ++ type: string |
| 133 | ++ pattern: '^([0-9]+([.][0-9])?)|[0-9]+(m)$' |
| 134 | ++ runtimeMemoryReqOverhead: |
| 135 | ++ type: string |
| 136 | ++ pattern: '^[0-9]+([.][0-9]+)+(Mi|Gi|M|G)$' |
| 137 | ++ runtimeMemoryLimitOverhead: |
| 138 | ++ type: string |
| 139 | ++ pattern: '^[0-9]+([.][0-9]+)+(Mi|Gi|M|G)$' |
| 140 | +``` |
| 141 | + |
| 142 | +### ContainerRuntime controller |
| 143 | + |
| 144 | +The pod resource overhead must be defined prior to scheduling, and we shouldn't make the user |
| 145 | +do it. To that end, I'm proposing a new mutating admission controller: ContainerRuntime. |
| 146 | + |
| 147 | +The ContainerRuntime controller will have a single job: set the pod overhead field in the workload's |
| 148 | +PodSpec according to the runtimeClass specified. |
| 149 | + |
| 150 | +It is expected that only the ContainerRuntime controller will set Pod.Spec.Overhead. If a prior value exists, |
| 151 | +the final pod spec will take the larger of what is defined in the runtimeClass and the original value. |
| 152 | + |
| 153 | +Going forward, I foresee additional controller scope around runtimeClass: |
| 154 | +validating the runtimeClass selection: This would require applying some kind of pod-characteristic labels (runtimeClass selectors?) which would then be consumed by an admission controller and checked against known capabilities on a per runtimeClass basis. This is is beyond the scope of this proposal. |
| 155 | +Automatic runtimeClass selection: A controller could exist which would attempt to automatically select the most appropriate runtimeClass for the given pod. This, again, is beyond the scope of this proposal. |
| 156 | + |
| 157 | +### User Stories [optional] |
| 158 | + |
| 159 | +Detail the things that people will be able to do if this KEP is implemented. |
| 160 | +Include as much detail as possible so that people can understand the "how" of the system. |
| 161 | +The goal here is to make this feel real for users without getting bogged down. |
| 162 | + |
| 163 | +#### Story 2 |
| 164 | + |
| 165 | +### Implementation Details/Notes/Constraints [optional] |
| 166 | + |
| 167 | +What are the caveats to the implementation? |
| 168 | +What are some important details that didn't come across above. |
| 169 | +Go in to as much detail as necessary here. |
| 170 | +This might be a good place to talk about core concepts and how they releate. |
| 171 | + |
| 172 | +### Risks and Mitigations |
| 173 | + |
| 174 | +What are the risks of this proposal and how do we mitigate. |
| 175 | +Think broadly. |
| 176 | +For example, consider both security and how this will impact the larger kubernetes ecosystem. |
| 177 | + |
| 178 | +How will security be reviewed and by whom? |
| 179 | +How will UX be reviewed and by whom? |
| 180 | + |
| 181 | +Consider including folks that also work outside the SIG or subproject. |
| 182 | + |
| 183 | +## Design Details |
| 184 | + |
| 185 | +### Test Plan |
| 186 | + |
| 187 | +**Note:** *Section not required until targeted at a release.* |
| 188 | + |
| 189 | +Consider the following in developing a test plan for this enhancement: |
| 190 | +- Will there be e2e and integration tests, in addition to unit tests? |
| 191 | +- How will it be tested in isolation vs with other components? |
| 192 | + |
| 193 | +No need to outline all of the test cases, just the general strategy. |
| 194 | +Anything that would count as tricky in the implementation and anything particularly challenging to test should be called out. |
| 195 | + |
| 196 | +All code is expected to have adequate tests (eventually with coverage expectations). |
| 197 | +Please adhere to the [Kubernetes testing guidelines][testing-guidelines] when drafting this test plan. |
| 198 | + |
| 199 | +[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md |
| 200 | + |
| 201 | +### Graduation Criteria |
| 202 | + |
| 203 | +**Note:** *Section not required until targeted at a release.* |
| 204 | + |
| 205 | +Define graduation milestones. |
| 206 | + |
| 207 | +These may be defined in terms of API maturity, or as something else. Initial KEP should keep |
| 208 | +this high-level with a focus on what signals will be looked at to determine graduation. |
| 209 | + |
| 210 | +Consider the following in developing the graduation criteria for this enhancement: |
| 211 | +- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] |
| 212 | +- [Deprecation policy][deprecation-policy] |
| 213 | + |
| 214 | +Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), |
| 215 | +or by redefining what graduation means. |
| 216 | + |
| 217 | +In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. |
| 218 | + |
| 219 | +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions |
| 220 | +[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ |
| 221 | + |
| 222 | +#### Examples |
| 223 | + |
| 224 | +These are generalized examples to consider, in addition to the aforementioned [maturity levels][maturity-levels]. |
| 225 | + |
| 226 | +##### Alpha -> Beta Graduation |
| 227 | + |
| 228 | +- Gather feedback from developers and surveys |
| 229 | +- Complete features A, B, C |
| 230 | +- Tests are in Testgrid and linked in KEP |
| 231 | + |
| 232 | +##### Beta -> GA Graduation |
| 233 | + |
| 234 | +- N examples of real world usage |
| 235 | +- N installs |
| 236 | +- More rigorous forms of testing e.g., downgrade tests and scalability tests |
| 237 | +- Allowing time for feedback |
| 238 | + |
| 239 | +**Note:** Generally we also wait at least 2 releases between beta and GA/stable, since there's no opportunity for user feedback, or even bug reports, in back-to-back releases. |
| 240 | + |
| 241 | +##### Removing a deprecated flag |
| 242 | + |
| 243 | +- Announce deprecation and support policy of the existing flag |
| 244 | +- Two versions passed since introducing the functionality which deprecates the flag (to address version skew) |
| 245 | +- Address feedback on usage/changed behavior, provided on GitHub issues |
| 246 | +- Deprecate the flag |
| 247 | + |
| 248 | +**For non-optional features moving to GA, the graduation criteria must include [conformance tests].** |
| 249 | + |
| 250 | +[conformance tests]: https://github.com/kubernetes/community/blob/master/contributors/devel/conformance-tests.md |
| 251 | + |
| 252 | +### Upgrade / Downgrade Strategy |
| 253 | + |
| 254 | +If applicable, how will the component be upgraded and downgraded? Make sure this is in the test plan. |
| 255 | + |
| 256 | +Consider the following in developing an upgrade/downgrade strategy for this enhancement: |
| 257 | +- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior? |
| 258 | +- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement? |
| 259 | + |
| 260 | +### Version Skew Strategy |
| 261 | + |
| 262 | +Set the overhead to the max of the two version until the rollout is complete. This may be more problematic |
| 263 | +if a new version increases (rather than decreases) the required resources. |
| 264 | + |
| 265 | +## Implementation History |
| 266 | + |
| 267 | +Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. |
| 268 | +Major milestones might include |
| 269 | + |
| 270 | +- the `Summary` and `Motivation` sections being merged signaling SIG acceptance |
| 271 | +- the `Proposal` section being merged signaling agreement on a proposed design |
| 272 | +- the date implementation started |
| 273 | +- the first Kubernetes release where an initial version of the KEP was available |
| 274 | +- the version of Kubernetes where the KEP graduated to general availability |
| 275 | +- when the KEP was retired or superseded |
| 276 | + |
| 277 | +## Drawbacks [optional] |
| 278 | + |
| 279 | +This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify. |
| 280 | + |
| 281 | +## Alternatives [optional] |
| 282 | + |
| 283 | +In order to achieve proper handling of sandbox runtimes, the scheduler/resourceQuota handling needs to take |
| 284 | +into account the overheads associated with running a particular runtimeClass. |
| 285 | + |
| 286 | +### Leaving the PodSpec unchaged |
| 287 | + |
| 288 | +Instead of tracking the overhead associated with running a workload with a given runtimeClass in the PodSpec, |
| 289 | +the Kubelet (for pod cgroup creation), the scheduler (for honoring reqests overhead for the pod) and the resource |
| 290 | +quota handling (for optionally taking requests overhead of a workload into account) will need to be augmented |
| 291 | +to add a sandbox overhead when applicable. |
| 292 | + |
| 293 | +Pros: |
| 294 | + * no changes to the pod spec |
| 295 | + * no need for a mutating admission controller |
| 296 | + |
| 297 | +Cons: |
| 298 | + * handling of the pod overhead is spread out across a few components |
| 299 | + * Not user perceptible from a workload perspective. |
| 300 | + |
0 commit comments