Add a section about the risks of priority and preemption #201

bsalamat · 2018-06-23T22:09:57Z

Add a section about the risks of priority and preemption.

/sig scheduling

bsalamat · 2018-06-23T22:15:57Z

cc\ @davidopp

justaugustus · 2018-06-23T23:23:42Z

/cc @davidopp

davidopp · 2018-06-24T08:13:38Z

releases/release-1.11/release_notes_draft.md

+is enabled by default.\
+  Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.
+
+  There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).


I guess this should say "There are two kinds of critical system pods" (not daemons)

davidopp · 2018-06-24T08:14:32Z

releases/release-1.11/release_notes_draft.md

+is enabled by default.\
+  Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.
+
+  There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).


Fix the first XXX to list the other node-level critical system pods and the second XXX to list the other cluster-level critical system pods.

davidopp · 2018-06-24T08:16:47Z

releases/release-1.11/release_notes_draft.md

+
+  In Kubernetes 1.11, priority/preemption is enabled by default and
+  * per-node daemons continue to be scheduled directly by the DaemonSet controller, bypassing the default scheduler. As in Kubernetes versions before 1.11, the DaemonSet controller does not preempt pods, so we continue to rely on the ["rescheduler"](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) to guarantee that per-node daemons are able to schedule in a cluster that is full of regular user pods, by evicting regular user pods to make room for them. Per-node daemons are given a priority class of `system-node-critical`.
+  * cluster-level system pods continue to be scheduled by the default scheduler. The cluster-level daemons are given a priority class of `system-cluster-critical`. Because the default scheduler can preempt pods, the rescheduler in Kubernetes 1.11 is modified to *not* preempt pods to ensure the cluster-level system pods can schedule; instead we rely on the scheduler preemption mechanism to do this.


s/cluster-level daemons/cluster-level system pods/

davidopp · 2018-06-24T08:20:24Z

releases/release-1.11/release_notes_draft.md

+
+  The only way to prevent this vulnerability is:
+  * Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
+  * Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)


s/is in/is an/

davidopp · 2018-06-24T08:30:49Z

releases/release-1.11/release_notes_draft.md

+  The only way to prevent this vulnerability is:
+  * Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
+  * Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
+  * Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)


I guess this should say "infinite ResourceQuota for pods"

davidopp · 2018-06-24T08:35:59Z

releases/release-1.11/release_notes_draft.md

+  * Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
+  * Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
+
+  This will prevent anyone who does not have access to the `kube-system` namespace from creating pods with the `system-node-critical` or `system-cluster-critical` priority class, by only allowing pods with those priority classes to be created in the `kube-system` namespace.


the "by only allowing..." part could be a bit clearer: "by restricting pods with those priority classes to only be allowed in the kube-system namespace."

bsalamat

Thanks, @davidopp! PTAL.

bsalamat · 2018-06-24T19:45:41Z

releases/release-1.11/release_notes_draft.md

+is enabled by default.\
+  Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.
+
+  There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).


bsalamat · 2018-06-24T19:45:44Z

releases/release-1.11/release_notes_draft.md

+is enabled by default.\
+  Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.
+
+  There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).


bsalamat · 2018-06-24T19:46:29Z

releases/release-1.11/release_notes_draft.md

+
+  In Kubernetes 1.11, priority/preemption is enabled by default and
+  * per-node daemons continue to be scheduled directly by the DaemonSet controller, bypassing the default scheduler. As in Kubernetes versions before 1.11, the DaemonSet controller does not preempt pods, so we continue to rely on the ["rescheduler"](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) to guarantee that per-node daemons are able to schedule in a cluster that is full of regular user pods, by evicting regular user pods to make room for them. Per-node daemons are given a priority class of `system-node-critical`.
+  * cluster-level system pods continue to be scheduled by the default scheduler. The cluster-level daemons are given a priority class of `system-cluster-critical`. Because the default scheduler can preempt pods, the rescheduler in Kubernetes 1.11 is modified to *not* preempt pods to ensure the cluster-level system pods can schedule; instead we rely on the scheduler preemption mechanism to do this.


bsalamat · 2018-06-24T19:46:51Z

releases/release-1.11/release_notes_draft.md

+
+  The only way to prevent this vulnerability is:
+  * Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
+  * Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)


bsalamat · 2018-06-24T19:47:28Z

releases/release-1.11/release_notes_draft.md

+  The only way to prevent this vulnerability is:
+  * Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
+  * Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
+  * Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)


bsalamat · 2018-06-24T19:49:18Z

releases/release-1.11/release_notes_draft.md

+  * Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
+  * Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
+
+  This will prevent anyone who does not have access to the `kube-system` namespace from creating pods with the `system-node-critical` or `system-cluster-critical` priority class, by only allowing pods with those priority classes to be created in the `kube-system` namespace.


davidopp · 2018-06-24T20:04:21Z

/lgtm

k8s-ci-robot · 2018-06-24T20:04:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bsalamat, davidopp
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

releases/release-1.11/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AishSundar · 2018-06-25T18:03:36Z

/cc @nickchase

bsalamat · 2018-06-26T21:05:24Z

ping @nickchase @calebamiles for approval

nickchase · 2018-06-26T21:18:11Z

An abbreviated version of this has been added to the current doc, here: https://docs.google.com/document/d/1MoHdmqSpWT4dJ3AcONwPwquNa2NIBa1dhpb0g8xyyoI/edit with a link to this PR for the full story. If someone's got a better idea, I'm all ears.

bsalamat · 2018-06-26T22:27:23Z

@davidopp FYI

davidopp · 2018-06-27T06:58:00Z

The part you extracted seems fine, but please link to this PR rather than the one you are currently linking to.

davidopp · 2018-06-28T06:35:40Z

We'll need a new release note in 1.11.1 that explains the new admission controller that eliminates (for all practical purposes) the vulnerability.

bsalamat · 2018-06-28T07:20:58Z

@davidopp Sure. I will take care of that.

Add a section about the risks of priority and preemption

c6a8e07

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 23, 2018

k8s-ci-robot requested a review from davidopp June 23, 2018 23:23

davidopp reviewed Jun 24, 2018

View reviewed changes

addressed review comments

7c523f2

bsalamat commented Jun 24, 2018

View reviewed changes

k8s-ci-robot assigned davidopp Jun 24, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2018

k8s-ci-robot requested a review from nickchase June 25, 2018 18:03

bsalamat closed this Jun 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a section about the risks of priority and preemption #201

Add a section about the risks of priority and preemption #201

bsalamat commented Jun 23, 2018

bsalamat commented Jun 23, 2018

justaugustus commented Jun 23, 2018

davidopp Jun 24, 2018

bsalamat Jun 24, 2018

davidopp Jun 24, 2018

bsalamat Jun 24, 2018

davidopp Jun 24, 2018

bsalamat Jun 24, 2018

davidopp Jun 24, 2018

bsalamat Jun 24, 2018

davidopp Jun 24, 2018

bsalamat Jun 24, 2018

davidopp Jun 24, 2018

bsalamat Jun 24, 2018

bsalamat left a comment

bsalamat Jun 24, 2018

bsalamat Jun 24, 2018

bsalamat Jun 24, 2018

bsalamat Jun 24, 2018

bsalamat Jun 24, 2018

bsalamat Jun 24, 2018

davidopp commented Jun 24, 2018

k8s-ci-robot commented Jun 24, 2018

AishSundar commented Jun 25, 2018

bsalamat commented Jun 26, 2018

nickchase commented Jun 26, 2018

bsalamat commented Jun 26, 2018

davidopp commented Jun 27, 2018

davidopp commented Jun 28, 2018

bsalamat commented Jun 28, 2018

Add a section about the risks of priority and preemption #201

Add a section about the risks of priority and preemption #201

Conversation

bsalamat commented Jun 23, 2018

bsalamat commented Jun 23, 2018

justaugustus commented Jun 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp commented Jun 24, 2018

k8s-ci-robot commented Jun 24, 2018

AishSundar commented Jun 25, 2018

bsalamat commented Jun 26, 2018

nickchase commented Jun 26, 2018

bsalamat commented Jun 26, 2018

davidopp commented Jun 27, 2018

davidopp commented Jun 28, 2018

bsalamat commented Jun 28, 2018