-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a section about the risks of priority and preemption #201
Conversation
cc\ @davidopp |
/cc @davidopp |
is enabled by default.\ | ||
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters. | ||
|
||
There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should say "There are two kinds of critical system pods" (not daemons)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
is enabled by default.\ | ||
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters. | ||
|
||
There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the first XXX to list the other node-level critical system pods and the second XXX to list the other cluster-level critical system pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
In Kubernetes 1.11, priority/preemption is enabled by default and | ||
* per-node daemons continue to be scheduled directly by the DaemonSet controller, bypassing the default scheduler. As in Kubernetes versions before 1.11, the DaemonSet controller does not preempt pods, so we continue to rely on the ["rescheduler"](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) to guarantee that per-node daemons are able to schedule in a cluster that is full of regular user pods, by evicting regular user pods to make room for them. Per-node daemons are given a priority class of `system-node-critical`. | ||
* cluster-level system pods continue to be scheduled by the default scheduler. The cluster-level daemons are given a priority class of `system-cluster-critical`. Because the default scheduler can preempt pods, the rescheduler in Kubernetes 1.11 is modified to *not* preempt pods to ensure the cluster-level system pods can schedule; instead we rely on the scheduler preemption mechanism to do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/cluster-level daemons/cluster-level system pods/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
The only way to prevent this vulnerability is: | ||
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`. | ||
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is in/is an/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The only way to prevent this vulnerability is: | ||
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`. | ||
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11) | ||
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should say "infinite ResourceQuota for pods"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11) | ||
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/) | ||
|
||
This will prevent anyone who does not have access to the `kube-system` namespace from creating pods with the `system-node-critical` or `system-cluster-critical` priority class, by only allowing pods with those priority classes to be created in the `kube-system` namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "by only allowing..." part could be a bit clearer: "by restricting pods with those priority classes to only be allowed in the kube-system
namespace."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @davidopp! PTAL.
is enabled by default.\ | ||
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters. | ||
|
||
There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
is enabled by default.\ | ||
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters. | ||
|
||
There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
In Kubernetes 1.11, priority/preemption is enabled by default and | ||
* per-node daemons continue to be scheduled directly by the DaemonSet controller, bypassing the default scheduler. As in Kubernetes versions before 1.11, the DaemonSet controller does not preempt pods, so we continue to rely on the ["rescheduler"](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) to guarantee that per-node daemons are able to schedule in a cluster that is full of regular user pods, by evicting regular user pods to make room for them. Per-node daemons are given a priority class of `system-node-critical`. | ||
* cluster-level system pods continue to be scheduled by the default scheduler. The cluster-level daemons are given a priority class of `system-cluster-critical`. Because the default scheduler can preempt pods, the rescheduler in Kubernetes 1.11 is modified to *not* preempt pods to ensure the cluster-level system pods can schedule; instead we rely on the scheduler preemption mechanism to do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
The only way to prevent this vulnerability is: | ||
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`. | ||
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The only way to prevent this vulnerability is: | ||
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`. | ||
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11) | ||
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11) | ||
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/) | ||
|
||
This will prevent anyone who does not have access to the `kube-system` namespace from creating pods with the `system-node-critical` or `system-cluster-critical` priority class, by only allowing pods with those priority classes to be created in the `kube-system` namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bsalamat, davidopp Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @nickchase |
ping @nickchase @calebamiles for approval |
An abbreviated version of this has been added to the current doc, here: https://docs.google.com/document/d/1MoHdmqSpWT4dJ3AcONwPwquNa2NIBa1dhpb0g8xyyoI/edit with a link to this PR for the full story. If someone's got a better idea, I'm all ears. |
@davidopp FYI |
The part you extracted seems fine, but please link to this PR rather than the one you are currently linking to. |
We'll need a new release note in 1.11.1 that explains the new admission controller that eliminates (for all practical purposes) the vulnerability. |
@davidopp Sure. I will take care of that. |
Add a section about the risks of priority and preemption.
/sig scheduling