Skip to content

Commit 811260c

Browse files
authored
Merge pull request #3862 from everpeace/3610-kep-freeze
KEP-3619: update Test Plan and Graduation Criteria for KEP freeze
2 parents bce62fc + 3ccb75d commit 811260c

File tree

1 file changed

+153
-37
lines changed
  • keps/sig-node/3619-supplemental-groups-policy

1 file changed

+153
-37
lines changed

keps/sig-node/3619-supplemental-groups-policy/README.md

Lines changed: 153 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,35 @@
1-
# KEP-3619: Fine-grained SupplementalGroups control
2-
31
<!--
2+
**Note:** When your KEP is complete, all of these comment blocks should be removed.
3+
4+
To get started with this template:
5+
6+
- [ ] **Pick a hosting SIG.**
7+
Make sure that the problem space is something the SIG is interested in taking
8+
up. KEPs should not be checked in without a sponsoring SIG.
9+
- [ ] **Create an issue in kubernetes/enhancements**
10+
When filing an enhancement tracking issue, please make sure to complete all
11+
fields in that template. One of the fields asks for a link to the KEP. You
12+
can leave that blank until this KEP is filed, and then go back to the
13+
enhancement and add the link.
14+
- [ ] **Make a copy of this template directory.**
15+
Copy this template into the owning SIG's directory and name it
16+
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
17+
leading-zero padding) assigned to your enhancement above.
18+
- [ ] **Fill out as much of the kep.yaml file as you can.**
19+
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
20+
"Status", and date-related fields.
21+
- [ ] **Fill out this file as best you can.**
22+
At minimum, you should fill in the "Summary" and "Motivation" sections.
23+
These should be easy if you've preflighted the idea of the KEP with the
24+
appropriate SIG(s).
25+
- [ ] **Create a PR for this KEP.**
26+
Assign it to people in the SIG who are sponsoring this process.
27+
- [ ] **Merge early and iterate.**
28+
Avoid getting hung up on specific details and instead aim to get the goals of
29+
the KEP clarified and merged quickly. The best way to do this is to just
30+
start with the high-level sections and fill out details incrementally in
31+
subsequent PRs.
32+
433
Just because a KEP is merged does not mean it is complete or approved. Any KEP
534
marked as `provisional` is a working document and subject to change. You can
635
denote sections that are under active debate as follows:
@@ -10,6 +39,41 @@ denote sections that are under active debate as follows:
1039
Stuff that is being argued.
1140
<<[/UNRESOLVED]>>
1241
```
42+
43+
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
44+
focused. If you disagree with what is already in a document, open a new PR
45+
with suggested changes.
46+
47+
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
48+
You do not need a new KEP to move from beta to GA, for example. If
49+
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
50+
"implemented", major changes should get new KEPs.
51+
52+
The canonical place for the latest set of instructions (and the likely source
53+
of this file) is [here](/keps/NNNN-kep-template/README.md).
54+
55+
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
56+
it is marked `implementable`, must be approved by each of the KEP approvers.
57+
If none of those approvers are still appropriate, then changes to that list
58+
should be approved by the remaining approvers and/or the owning SIG (or
59+
SIG Architecture for cross-cutting KEPs).
60+
-->
61+
# KEP-3619: Fine-grained SupplementalGroups control
62+
63+
<!--
64+
This is the title of your KEP. Keep it short, simple, and descriptive. A good
65+
title can help communicate what the KEP is and should be considered as part of
66+
any review.
67+
-->
68+
69+
<!--
70+
A table of contents is helpful for quickly jumping to sections of a KEP and for
71+
highlighting any additional information provided beyond the standard KEP
72+
template.
73+
74+
Ensure the TOC is wrapped with
75+
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
76+
tags, and then generate with `hack/update-toc.sh`.
1377
-->
1478

1579
<!-- toc -->
@@ -27,7 +91,7 @@ Stuff that is being argued.
2791
- [CRI](#cri)
2892
- [SupplementalGroupsPolicy in SecurityContext](#supplementalgroupspolicy-in-securitycontext)
2993
- [user in ContainerStatus](#user-in-containerstatus-1)
30-
- [User Stories](#user-stories)
94+
- [User Stories (Optional)](#user-stories-optional)
3195
- [Story 1: Deploy a Security Policy to enforce <code>SupplementalGroupsPolicy</code> field](#story-1-deploy-a-security-policy-to-enforce--field)
3296
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
3397
- [Risks and Mitigations](#risks-and-mitigations)
@@ -44,6 +108,9 @@ Stuff that is being argued.
44108
- [Integration tests](#integration-tests)
45109
- [e2e tests](#e2e-tests)
46110
- [Graduation Criteria](#graduation-criteria)
111+
- [Alpha](#alpha)
112+
- [Beta](#beta)
113+
- [GA](#ga)
47114
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
48115
- [Version Skew Strategy](#version-skew-strategy)
49116
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -106,6 +173,25 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
106173

107174
## Summary
108175

176+
<!--
177+
This section is incredibly important for producing high-quality, user-focused
178+
documentation such as release notes or a development roadmap. It should be
179+
possible to collect this information before implementation begins, in order to
180+
avoid requiring implementors to split their attention between writing release
181+
notes and implementing the feature itself. KEP editors and SIG Docs
182+
should help to ensure that the tone and content of the `Summary` section is
183+
useful for a wide audience.
184+
185+
A good summary is probably at least a paragraph in length.
186+
187+
Both in this section and below, follow the guidelines of the [documentation
188+
style guide]. In particular, wrap lines to a reasonable length, to make it
189+
easier for reviewers to cite specific portions, and to minimize diff churn on
190+
updates.
191+
192+
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
193+
-->
194+
109195
The KEP seeks to provide a way to choose correct behavior with how Container Runtimes (Containerd and CRI-O) are applying `SupplementalGroups` to the first container processes. The KEP describes the work needed to be done in Kubernetes or connected projects to make sure customers have a clear migration path - including detection and safe upgrade - if any of their workflows took a dependency on this arguably erroneous behavior.
110196

111197
### The issue
@@ -161,6 +247,15 @@ uid=1000(alice) gid=1000(alice) groups=1000(alice),50000(group-in-image),60000
161247

162248
## Motivation
163249

250+
<!--
251+
This section is for explicitly listing the motivation, goals, and non-goals of
252+
this KEP. Describe why the change is important and the benefits to users. The
253+
motivation section can optionally provide links to [experience reports] to
254+
demonstrate the interest in a KEP within the wider Kubernetes community.
255+
256+
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
257+
-->
258+
164259
As described above, how supplemental groups attached to the first container process is complicated and not OCI image spec compliant.
165260

166261
Moreover, this causes security considerations as follows. When a cluster enforces some security policy for pods that protects the value of `RunAsGroup` and `SupplementalGroups`, the effect of its enforcement is limited, i.e., cluster users can easily bypass the policy enforcement just by using a custom image. If such a bypass happened, it would be unexpected behavior for most cluster administrators because the enforcement is almost useless. Moreover, the bypass will cause unexpected file access permission. In some use cases, the unexpected file access permission will be a security concern. For example, using `hostPath` volumes could be a severe problem because UID/GIDs matter in accessing files/directories in the volumes.
@@ -254,7 +349,7 @@ message ContainerUser {
254349
}
255350
```
256351

257-
### User Stories
352+
### User Stories (Optional)
258353

259354
<!--
260355
Detail the things that people will be able to do if this KEP is implemented.
@@ -263,6 +358,7 @@ the system. The goal here is to make this feel real for users without getting
263358
bogged down.
264359
-->
265360

361+
266362
#### Story 1: Deploy a Security Policy to enforce `SupplementalGroupsPolicy` field
267363

268364
Assume a multi-tenant kubernetes cluster with `hostPath` volumes below situations:
@@ -294,6 +390,8 @@ As described in [Summary](#summary) section, `alice` can bypass the restriction
294390

295391
Please note that a security policy without `supplementalGroupsPolicy` would lead to unexpected groups for the first process in the containers.
296392

393+
<!-- #### Story 2 -->
394+
297395
### Notes/Constraints/Caveats (Optional)
298396

299397
<!--
@@ -325,6 +423,13 @@ Consider including folks who also work outside the SIG or subproject.
325423

326424
## Design Details
327425

426+
<!--
427+
This section should contain enough information that the specifics of your
428+
change are understandable. This may include API specs (though not always
429+
required) or even code snippets. If there's any ambiguity about HOW your
430+
proposal will be implemented, this is the place to discuss them.
431+
-->
432+
328433
### Kubernetes API
329434

330435
#### SupplementalGroupsPolicy in PodSecurityContext
@@ -473,7 +578,7 @@ when drafting this test plan.
473578
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
474579
-->
475580

476-
[ ] I/we understand the owners of the involved components may require updates to
581+
[x] I/we understand the owners of the involved components may require updates to
477582
existing tests to make this code solid enough prior to committing the changes necessary
478583
to implement this enhancement.
479584

@@ -505,7 +610,8 @@ This can inform certain test coverage improvements that we want to do before
505610
extending the production code to implement this enhancement.
506611
-->
507612

508-
- `<package>`: `<date>` - `<test coverage>`
613+
- `k8s.io/kubernetes/pkg/apis/core/validation`: `<date>(t.b.d.)` - `<test coverage>(t.b.d.)`
614+
- validation tests for `PodSecurityContext.SupplementalGroups`, `ContainerStatus.User`
509615

510616
##### Integration tests
511617

@@ -517,7 +623,12 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
517623
https://storage.googleapis.com/k8s-triage/index.html
518624
-->
519625

520-
- <test>: <link to test coverage>
626+
- Kubernetes API
627+
- When `SupplementalGroupsPolicy=Strict`, groups of the container process must be ones specified by API: <link to test coverage(t.b.d.)>
628+
- When `SupplementalGroupsPolicy=Merge`, groups of the container process contains both groups specified by API and groups of the primary user from the image: <link to test coverage(t.b.d.)>
629+
- For running pods, `ContainerStatus.User` contains the correct identities of the containers: <link to test coverage(t.b.d.)>
630+
- CRI
631+
- I will also add symmetrical integration tests to https://github.com/kubernetes-sigs/cri-tools
521632

522633
##### e2e tests
523634

@@ -531,7 +642,9 @@ https://storage.googleapis.com/k8s-triage/index.html
531642
We expect no non-infra related flakes in the last month as a GA graduation criteria.
532643
-->
533644

534-
- <test>: <link to test coverage>
645+
- When creating a Pod with `SupplementalGroupsPolicy=Strict`, the pods will run with only groups specified by API: <link to test coverage(t.b.d.)>
646+
- When creating a Pod with `SupplementalGroupsPolicy=Merge`, the pods will run with groups specified by API and groups from the image: <link to test coverage(t.b.d.)>
647+
- When creating a Pod and it starts, each `ContainerStatus.User` contain the correct identities of the containers: <link to test coverage(t.b.d.)>
535648

536649
### Graduation Criteria
537650

@@ -561,42 +674,30 @@ functionality is accessed.
561674
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
562675
563676
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
677+
-->
678+
679+
Because this KEP's core implementation(i.e. `SupplementalGroupsPolicy` handling) lies inside of CRI implementations(e.g. containerd, cri-o), the graduation criteria contains the support statuses of the updated CRI by container runtimes.
564680

565681
#### Alpha
566682

567-
- Feature implemented behind a feature flag
568-
- Initial e2e tests completed and enabled
683+
- At least one of the most popular Container Runtimes(e.g. containerd) implements the updated CRI and released
684+
- Feature implemented behind a feature flag based on the Container Runtime
685+
- Unit tests and initial e2e tests completed and enabled
569686

570687
#### Beta
571688

572-
- Gather feedback from developers and surveys
573-
- Complete features A, B, C
574-
- Additional tests are in Testgrid and linked in KEP
689+
- Several popular Container Runtimes(e.g. containerd and cri-o) support the updated CRI and released
690+
- Fixed reported bugs from the community
691+
- Additional integration tests and e2e tests are in Testgrid and linked in KEP
575692

576693
#### GA
577694

578-
- N examples of real-world usage
579-
- N installs
580-
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
581-
- Allowing time for feedback
582-
583-
**Note:** Generally we also wait at least two releases between beta and
584-
GA/stable, because there's no opportunity for user feedback, or even bug reports,
585-
in back-to-back releases.
586-
587-
**For non-optional features moving to GA, the graduation criteria must include
588-
[conformance tests].**
695+
- At least one of Container Runtimes which is not based on the classic container, gVisor for example, supports the updated CRI and released
696+
- Assuming no negative user feedback based on production experience, promote after 2 releases in beta.
697+
- [conformance tests] are added for `SupplementalGroupsPolicy` and `ContainerStatus.User` APIs
589698

590699
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
591700

592-
#### Deprecation
593-
594-
- Announce deprecation and support policy of the existing flag
595-
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
596-
- Address feedback on usage/changed behavior, provided on GitHub issues
597-
- Deprecate the flag
598-
-->
599-
600701
### Upgrade / Downgrade Strategy
601702

602703
<!--
@@ -626,7 +727,7 @@ enhancement:
626727
CRI or CNI may require updating that component before the kubelet.
627728
-->
628729

629-
- CRI must support this feature, especially when using `SupplementalGroupsPolicy=IgnoreGroupsInImage`.
730+
- CRI must support this feature, especially when using `SupplementalGroupsPolicy=Strict`.
630731
- kubelet must be at least the version of control-plane components.
631732

632733
## Production Readiness Review Questionnaire
@@ -687,6 +788,7 @@ well as the [existing list] of feature gates.
687788
Any change of default behavior may be surprising to users or break existing
688789
automations, so be extremely careful here.
689790
-->
791+
690792
No. Just introducing new API fields in Pod spec and CRI which does NOT change the default behavior.
691793

692794
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
@@ -702,11 +804,11 @@ feature.
702804
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
703805
-->
704806

705-
Yes. It can be disabled after enabled. However, users should pay attention that gids of container processes in pods with `IgnoreGroupsInImage` policy would change. It means the action might break the application in permission. We plan to provide a way for users to detect which pods are affected.
807+
Yes. It can be disabled after enabled. However, users should pay attention that gids of container processes in pods with `Strict` policy would change. It means the action might break the application in permission. We plan to provide a way for users to detect which pods are affected.
706808

707809
###### What happens if we reenable the feature if it was previously rolled back?
708810

709-
Just the policy `IgnoreGroupsInImage` is reenabled. Users should pay attention that gids of containers in pods with `IgnoreGroupsInImage` policy would change. It means that the action might break the application in permission. We plan to provide a way for users to detect which pods are affected.
811+
Just the policy `Stcict` is reenabled. Users should pay attention that gids of containers in pods with `Stcict` policy would change. It means that the action might break the application in permission. We plan to provide a way for users to detect which pods are affected.
710812

711813
###### Are there any tests for feature enablement/disablement?
712814

@@ -919,7 +1021,7 @@ Describe them, providing:
9191021
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
9201022
-->
9211023

922-
No.
1024+
Precisely, yes because the kep introduces new API fields in Pods. But the increasing size can be negligible.
9231025

9241026
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
9251027

@@ -948,6 +1050,18 @@ This through this both in small and large cases, again with respect to the
9481050

9491051
No.
9501052

1053+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
1054+
1055+
<!--
1056+
Focus not just on happy cases, but primarily on more pathological cases
1057+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1058+
If any of the resources can be exhausted, how this is mitigated with the existing limits
1059+
(e.g. pods per node) or new limits added by this KEP?
1060+
1061+
Are there any tests that were run/should be run to understand performance characteristics better
1062+
and validate the declared limits?
1063+
-->
1064+
9511065
### Troubleshooting
9521066

9531067
<!--
@@ -999,6 +1113,8 @@ Major milestones might include:
9991113
Why should this KEP _not_ be implemented?
10001114
-->
10011115

1116+
N/A
1117+
10021118
## Alternatives
10031119

10041120
<!--
@@ -1027,4 +1143,4 @@ new subproject, repos requested, or GitHub details. Listing these here allows a
10271143
SIG to get the process for these resources started right away.
10281144
-->
10291145

1030-
N/A
1146+
N/A

0 commit comments

Comments
 (0)