- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Kubernetes currently does not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, swap support was considered out of scope.
However, there are a number of use cases that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.
This KEP aims to introduce basic swap enablement and leave further extensions to follow-up KEPs. This way Kubernetes users / vendors would be able to use swap in a basic manner quickly while extensions would be brought to discussion in dedicated KEPs that would progress in the meantime.
For example, to achieve this goal, this KEP does not introduce any APIs
that allow customizing how the feature behaves, but instead only determines
whether the feature is enabled or disabled.
From an API perspective, this is being done by presenting the kubelet swapBehavior
configuration field.
Within the scope of this KEP we will support only two basic behaviors: NoSwap
and LimitedSwap
.
Both do not provide any customizability, as NoSwap
disables swap for workloads and
LimitedSwap
's behaviour is automatic and implicit that requires minimum user
intervention (see proposal below for more details).
As mentioned above, in the very near future, follow-up KEPs would bring API extension
and customizability, supporting zswap, and many other extensions to discussion.
These customization capabilities will probably be introduced as additional
"swap behaviors" which will probably bring some API changes, perhaps at the pod level,
that will extend this feature and make it more suitable for advanced usage.
While this KEP sets the ground for extending the API in follow-ups through "swap behaviors", changing APIs, especially at the pod-level, is highly complex and controversial. Therefore, it is out of scope for this KEP.
There are two distinct types of user for swap, who may overlap:
- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
- application developers, who have written applications that would benefit from using swap memory
There are hence a number of possible ways that one could envision swap use on a node.
- Swap is enabled on a node's host system, but the kubelet does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
- Swap is enabled at the node level. The kubelet can permit Kubernetes workloads scheduled on the node to use some quantity of swap, depending on the configuration.
- Swap is set on a per-workload basis. The kubelet sets swap limits for each individual workload.
This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.
- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
- Configuration is available for kubelet to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
- Cluster administrators can enable and configure kubelet swap utilization on a per-node basis.
- Use of swap memory for cgroupsv2.
- Swap-aware eviction manager.
- Addressing non-Linux operating systems. Swap support will only be available for Linux.
- Provisioning swap. Swap must already be available on the system.
- Setting swappiness. This can already be set on a system-wide level outside of Kubernetes.
- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap), and/or APIs to customize and control the way kubelet calculates swap limits, grants swap access, etc. If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, swap will be an overcommitted resource in the context of this KEP.
- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
- Use of swap for cgroupsv1.
We propose that, when swap is provisioned and available on a node, cluster administrators can configure the kubelet such that:
- It can start with swap on.
- It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
- It will have configuration options to configure swap utilization for the entire node.
This proposal enables scenarios 1 and 2 above, but not 3.
Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually.
Swap access is granted only for pods of Burstable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictable, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.
By doing so, we can ensure a thorough understanding of the feature's performance and stability before considering the manual input of swap values in a subsequent beta release. This cautious approach will ensure the efficient allocation of resources and the smooth integration of swap support into Kubernetes.
Allocate the swap limit equal to the requested memory for each container and adjust the proportion of swap based on the total swap memory available.
Note In Beta2, we found that having system-critical daemons swapping memory could cause degradation of services. Therefore, Kubelet will not automatically configure this, although the admin can still manually configure it this way. In the near future, when a follow-up KEP regarding customizability is presented, this will be considered to automatically be configured under a dedicated configuration.
System critical daemons (such as Kubelet) are essential for node health. Usually, an appropriate portion of system resources (e.g., memory, CPU) is reserved as system reserved. However, swap doesn't inherently support reserving a portion out of the total available. For instance, in the case of memory, we set memory.min
on the node-level cgroup to ensure an adequate amount of memory is set aside, away from the pods, and for system critical daemons. But there is no equivalent for swap; i.e., no memory.swap.min
is supported in the kernel.
Since this proposal advocates enabling swap only for the Burstable QoS pods, this can be done by setting memory.swap.max
on the cgroups used by the Burstable QoS pods. The value of this memory.swap.max
can be calculated by:
memory.swap.max = total swap memory available on the system - system reserve (memory)
This is the total amount of swap available for all the Burstable QoS pods; let's call it TotalPodsSwapAvailable
. This will ensure that the system critical daemons will have access to the swap at least equal to the system reserved memory. This will indirectly act as having support for swap in system reserved.
This section is a recommendation for how to set up your nodes with swap if using this feature.
As we were testing this feature, we found degration of services if you allow system critical daemons to swap.
This could mean that kubelet is performing slower than normal so if you experience this,
we recommend setting the cgroup for the system slice to avoid swap (ie memory.swap.max 0
).
While doing this, we found that it is still possible for workloads to impact critical daemons.
As we disabled swap for system slice, we saw cases where the system.slice would still be impacted by workloads swapping.
The workloads need to have less priority for IO than the system slice. We found that setting io.latency
for system.slice fixes these issues.
See io-control for more details.
We only recommend enabling swap for the worker nodes. The control plane contains mostly Guaranteed QoS Pods, so swap may be disabled for the most part. The main concern would be swapping in the critical services on the control plane which can cause a performance impact.
We recommend using a separate disk for your swap partition. We recommend the separate disk be encrypted. If swap is on a partition or the root filesystem, workloads can interfere with system processes needing to write to disk. If they occupy the same disk, it's possible processes can overwhelm swap and throw off the I/O of kubelet/container runtime/systemd, which would affect other workloads. See [#protect-system-critical-daemons-for-iolatency] for more details on that. Swap space is located on a disk so it is imperative to make sure your disk is fast enough for your use cases.
We will turn the feature on for Beta 2 but the default setting will be NoSwap
.
Enabling Swap on nodes is a pretty advanced feature which requires tuning and knowledge of the kernel.
We do not recommend swap on all nodes so we still suggest --fail-swap-on=true
for most cases of Kubernetes.
If there is interest in trying out this feature, we suggest provisioning swap space on the worker node along with setting ``--fail-swap-on=false` and restarting kubelet.
- Calculate the container's memory proportionate to the node's memory:
- Divide the container's memory request by the total node's physical memory. Let's call this value
ContainerMemoryProportion
. - If a container is defined with memory requests == memory limits, its
ContainerMemoryProportion
is defined as 0. Therefore, as can be seen below, its overall swap limit is also 0.
- Multiply the container memory proportion by the available swap memory for Pods:
- Meaning:
ContainerMemoryProportion * TotalPodsSwapAvailable
.
Suppose we have a Burstable QoS pod with two containers:
- Container A: Memory request 20 GB
- Container B: Memory request 10 GB
Let's assume the total physical memory is 40 GB and the total swap memory available is also 40 GB. Also assume that the system reserved memory is configured at 2GB,
Step 1: Determine the containers memory proportion:
- Container A:
20G/40G
=0.5
. - Container B:
10G/40G
=0.25
.
Step 2: Determine swap limitation for the containers:
- Container A:
ContainerMemoryProportion * TotalPodsSwapAvailable
=0.5 * 38G
=19G
. - Container B:
ContainerMemoryProportion * TotalPodsSwapAvailable
=0.25 * 38G
=9.5G
.
In this example, Container A would have a swap limit of 19 GB, and Container B would have a swap limit of 9.5 GB.
This approach allocates swap limits based on each container's memory request and adjusts the proportion based on the total swap memory available in the system. It ensures that each container gets a fair share of the swap space and helps maintain resource allocation efficiency.
As part of this KEP, the eviction manager will be enhanced to be swap-aware. This update will enable the eviction manager to account for swap usage in its decision-making process. By doing so, it will help prevent the system from exhausting swap space, thereby maintaining system stability and responsiveness.
Before this KEP, kubelet's eviction manager completely overlooked swap memory, leading to several issues:
- Inaccessible Swap: The memory eviction threshold is configured in such a way that swap is never triggered during node-level pressure, as eviction occurs before the node starts swapping memory.
- Unfairness & Instability: The eviction manager may evict the "wrong" or innocent pods, failing to address the actual memory pressure.
- Unexpected Behavior: Pods that exceed their memory limits (with regular and swap memory) are not evicted first, even though they would immediately get killed if swap were not used.
Here we present an extension to the eviction manager that will address these issues by becoming swap-aware. The proposed logic is fully backward compatible and requires no additional configuration, making it completely transparent to the user.
To achieve this, we recommend enhancing the eviction manager's memory pressure handling to account for swap memory, rather than adding a distinct swap signal. Memory and swap are inherently connected and should be addressed as a single issue. By integrating swap memory into the eviction manager's logic, we ensure a more accurate and efficient handling of system resources. For example, separating memory and swap memory is problematic because swap is not used until memory is full. However, with the approach suggested in this KEP memory will not be considered full until the accessible swap is also full.
Let accessible swap
be the amount of swap that is accessible by pods according to the LimitedSwap swap behavior.
Note that the amount of accessible swap changes in time according to the pods running on the node.
In addition, note that since only some of the Burstable QoS pods will have access to swap, the swap space will almost never be used in its entirety by workloads. In other words, this approach will effectively leave some of the swap space inaccessible for pods and reserved for system daemons and other system processes.
When dealing with evictions, there are two main questions needed to be answered: how to identify when the node is under pressure and how to rank pods for eviction.
The eviction manager will become swap aware by making the following changes to its memory pressure handling:
- How to identify pressure: The eviction manager will consider the total sum of all running pods' accessible swap as additional memory capacity.
- How to rank pods for eviction: In the context of ranking pods for evictions, swap memory is considered as additional "regular" memory and accessible swap is considered as additional memory request. This is relevant for checking whether memory requests are exceeded [1] or for identifying which pods uses more memory [2].
In other words, the order of evictions documented [3] will have to change to the following:
The kubelet uses the following parameters to determine the pod eviction order: 1. Whether the pod's resource usage with swap (memory usage + swap usage) exceeds requests with swap (memory requests + swap requests). 2. [Pod Priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/). 3. The pod's resource usage (memory usage + swap usage) relative to requests (memory requests + swap requests).
That is, in (1) and (3) swap is considered as an additional resource usage and memory request. Step (2) is unchanged.
On nodes with swap disabled, the accessible swap will equal to zero and pods won't be able to use swap, hence the eviction manager will behave the same as before.
[1] https://github.com/kubernetes/kubernetes/blob/d8093cc40394b8e25a864576fe6a38306730d3cb/pkg/kubelet/eviction/helpers.go#L684 [2] https://github.com/kubernetes/kubernetes/blob/d8093cc40394b8e25a864576fe6a38306730d3cb/pkg/kubelet/eviction/helpers.go#L703 [3] https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#pod-selection-for-kubelet-eviction
cgroupsv2 improved memory management algorithms, such as oomd, strongly recommend the use of swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.
- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
- https://chrisdown.name/2018/01/02/in-defence-of-swap.html
- https://media.ccc.de/v/ASG2018-175-oomd
- https://github.com/facebookincubator/oomd/blob/master/docs/production_setup.md#swap
This user story is addressed by scenario 1 and 2, and could benefit from 3.
Note: critical / high-priority pods would not be able to access swap, but can still be configured otherwise to gain swap access. In the future, APIs / more swap behaviors would be able to be used to control swap in a more customized way.
- Applications such as the Java and Node runtimes rely on swap for optimal performance kubernetes/kubernetes#53533 (comment)
- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).
- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs kubernetes/kubernetes#53533 (comment)
- Lack of swap support would require provisioning 3x the amount of memory as required with swap kubernetes/kubernetes#53533 (comment)
- On-premise deployment can’t horizontally scale available memory based on load kubernetes/kubernetes#53533 (comment)
- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).
- Single node, local Kubernetes deployment on laptop kubernetes/kubernetes#53533 (comment)
- Linux has optimizations for swap on SSD, allowing for performance boosts kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenarios 1 and 2, and could benefit from 3.
For example, edge devices with limited memory.
- Edge compute systems/devices with small memory footprints (<2Gi) kubernetes/kubernetes#53533 (comment) k0sproject/k0s#3830
- Clusters with nodes <4Gi memory kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.
Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.
With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.
- Required for live migration of VMs kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
In updating the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
We considered adding parameters for both per-workload memory-swap
and
swappiness
. These are documented as part of the Open Containers runtime
specification for Linux memory configuration. Since memory-swap
is a
per-workload parameter, and swappiness
is optional and can be set globally,
we are choosing to only expose memory-swap
which will adjust swap available
to workloads.
Since we are not currently setting memory-swap
in the CRI, the current
default behaviour when --fail-swap-on=false
is set is to allocate the same
amount of swap for a workload as memory requested. We will update the default
to not permit the use of swap by setting memory-swap
equal to limit
.
This feature was created so that we iterate on adding swap to Kubernetes. Due to this, pods will not be able to request swap memory directly nor explicitly for the current implementation. To make swap more useful for workloads, we acknowledge the need for proper APIs for swap to make it customizable and flexible.
For example, we're considering the following features for future KEPs:
- Swap should be opt-in and opt-out at the workload level.
- Customization of swap limit calculation for workloads.
- Eviction Manager to be more flexible in regards to swap limits.
- Eviction Manager should look at more advanced ways of determining swap pressure (PSI for example).
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
As beta2 was being worked on, we discovered use cases where --fail-swap-on=false
is used but Kubernetes is not utilizing swap.
Kind e2e tests run kubelet with --fail-swap-on=false
and
the default developer configuration for hack/local-up-cluster
allows for running developer clusters with swap enabled.
Now, --fail-swap-on=false
is supported for both cgroup v1 and cgroupv2 although KEP-2400 does not support cgroup v1.
This is achieved by the newly introduced MemorySwap
called NoSwap
, which serves as the default swap behavior, that
will disable swap usage on the node while keeping the feature active.
In addition, nodes that support cgroups v1 only would be able to only use NoSwap
, i.e. in such environments containers
will be restricted from having access to swap.
This addresses existing use cases where --fail-swap-on=false
in cgroupv1 and still allow us to turn this feature on.
In previous releases of Swap, we had an UnlimitedSwap
option for workloads.
This can cause problems where workloads can use up all swap.
If all swap is used up on a node, it can make the node go unhealthy.
To avoid exhausting swap on a node, UnlimitedSwap
was dropped from the API in beta2.
It was determined that the eviction manager should still be able to protect the node in case of swap memory pressure. In this case, we will teach the eviction manager to be aware of swap as a resource to avoid exhausting swap resource.
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system. The documentation updates are required; there is already a blog article that mentions the security implications.
To guarantee that system daemons are not swapped, the kubelet must configure the memory.swap.max
setting to 0
within the system reserved cgroup. Moreover, to make sure that burstable pods are able to utilize swap space, kubelet should verify that the cgroup associated with burstable pods should not be nested under the cgroup designated for system reserved.
Additionally, end user may decide to disable swap completely for a Pod or a container in beta 1 by making Pod guaranteed or set request == limit for a container. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.
In the early release of this feature, there was a goal to support cgroup v1. As the feature progressed, sig-node realized that supporting swap with cgroup v1 would be very difficult. Therefore, this feature is limited to cgroupv2 only. The main goal is to deprecate cgroupv1 eventually so this should not be a major inconvience.
Kubernetes guarantees that some volumes' memory would never reside on disk, e.g. Secrets, memory-backed emptyDirs, etc. Behind the scenes, Kubelet mounts such volumes as tmpfs volumes on the host.
To address this risk, if --fail-swap-on=false
, the tmpfs noswap option
will be used in order to prevent the volumes' pages from swapping to disk.
Bear in mind that the tmpfs noswap option is fairly new and is supported in kernel versions >= 6.4. However, different
Linux distributions can decide to backport this options to older versions of the kernel. Therefore, when
--fail-swap-on=false
is being provided on a node:
- If the kernel version equals or is above 6.4, the tmpfs noswap option is being used when necessary.
- Else, kubelet would try to mount a dummy volume with the tmpfs noswap option to understand whether the option is backported. If the mount succeeds, the tmpfs noswap option is being used when necessary.
- Else, kubelet would raise a warning about the option not being supported and the possible risk.
In the longer term, when this option would be very widely supported, this would no longer be a concern, hence this logic could be dropped.
We summarize the implementation plan as following:
- Add a feature gate
NodeSwap
to enable swap support. - Leave the default value of kubelet flag
--fail-on-swap
totrue
, to avoid changing default behaviour. - Introduce a new kubelet config parameter,
MemorySwap
, which configures how much swap Kubernetes workloads can use on the node. - Introduce a new CRI parameter,
memory_swap_limit_in_bytes
. - Ensure container runtimes are updated so they can make use of the new CRI configuration.
- Based on the behaviour set in the kubelet config, the kubelet will instruct the CRI on the amount of swap to allocate to each container. The container runtime will then write the swap settings to the container level cgroup.
- Add node stats to report swap usage.
- Enhance eviction manager to protect against swap memory running out.
Swap can be enabled as follows:
- Provision swap on the target worker nodes,
- Enable the
NodeSwap
feature flag on the kubelet, - Set
--fail-on-swap
flag tofalse
, and - (Optional) Allow Kubernetes workloads to use swap by setting
MemorySwap.SwapBehavior
toLimitedSwap
in the kubelet config.
We will add an optional MemorySwap
value to the KubeletConfig
struct
in pkg/kubelet/apis/config/types.go as follows:
// KubeletConfiguration contains the configuration for the Kubelet
type KubeletConfiguration struct {
metav1.TypeMeta
...
// Configure swap memory available to container workloads.
// +featureGate=NodeSwap
// +optional
MemorySwap MemorySwapConfiguration
}
type MemorySwapConfiguration struct {
// swapBehavior configures swap memory available to container workloads. May be one of
// "", "NoSwap": workloads can not use swap, default option.
// "LimitedSwap": workload swap usage is limited. The swap limit is proportionate to the container's memory request.
// +featureGate=NodeSwap
// +optional
SwapBehavior string `json:"swapBehavior,omitempty"`}
We want to expose common swap configurations based on the Docker and open
container specification for the --memory-swap
flag. Thus, the
MemorySwapConfiguration.SwapBehavior
setting will have the following effects:
- If
SwapBehavior
is set to"LimitedSwap"
, containers have limited (or no) swap access. See "Steps to Calculate Swap Limit" above. - If
SwapBehavior
is set to""
or"NoSwap"
, no workloads will utilize swap.
The CRI requires a corresponding change in order to allow the kubelet to set
swap usage in container runtimes. We will introduce a parameter
memory_swap_limit_in_bytes
to the CRI API (found in
k8s.io/cri-api/pkg/apis/runtime/v1/api.proto):
// LinuxContainerResources specifies Linux specific configuration for
// resources.
message LinuxContainerResources {
...
// Memory + swap limit in bytes. Default: 0 (not specified).
int64 memory_swap_limit_in_bytes = 9;
...
}
We added metrics to the summary stats for the Node to report
SwapAvailableBytes
and SwapUsageBytes
.
type NodeStats struct {
...
// Stats pertaining to swap resources. This is reported to non-windows systems only.
// +optional
Swap *SwapStats `json:"swap,omitempty"`
}
// SwapStats contains data about memory usage
type SwapStats struct {
// The time at which these stats were updated.
Time metav1.Time `json:"time"`
// Available swap memory for use. This is defined as the <swap-limit> - <current-swap-usage>.
// If swap limit is undefined, this value is omitted.
// +optional
SwapAvailableBytes *uint64 `json:"swapAvailableBytes,omitempty"`
// Total swap memory in use.
// +optional
SwapUsageBytes *uint64 `json:"swapUsageBytes,omitempty"`
}
In addition, we've added swap to stats to Summary API and Prometheus endpoints (/stats/summary
and
/metrics/resource
):
> kubectl get --raw "/api/v1/nodes/<NODE-NAME>/proxy/metrics/resource" | grep swap
# HELP container_swap_usage_bytes [ALPHA] Current container amount of swap usage in bytes
# TYPE container_swap_usage_bytes gauge
container_swap_usage_bytes{container="c1",namespace="default",pod="test-pod"} 3.4400333824e+10 1687950863878
container_swap_usage_bytes{container="coredns",namespace="kube-system",pod="coredns-8f5847b64-t9gmr"} 0 1687950855483
# HELP node_swap_usage_bytes [ALPHA] Current node swap usage in bytes
# TYPE node_swap_usage_bytes gauge
node_swap_usage_bytes 1.8446743709127774e+19 1687950863599
# HELP pod_swap_usage_bytes [ALPHA] Current pod amount of swap usage in bytes
# TYPE pod_swap_usage_bytes gauge
pod_swap_usage_bytes{namespace="default",pod="test-pod"} 3.4379333632e+10 1687950858784
pod_swap_usage_bytes{namespace="kube-system",pod="coredns-8f5847b64-t9gmr"} 0 1687950863144
> kubectl get --raw "/api/v1/nodes/localhost/proxy/stats/summary"
node:
nodeName: localhost
swap:
swapAvailableBytes: 407531442176
swapUsageBytes: 18446743709127774000
pods:
- name: test-pod
swapUsageBytes: 34379333632
containers:
- name: c1
swapUsageBytes: 34400333824
- name: coredns-8f5847b64-t9gmr
swapUsageBytes: 0
containers:
- name: coredns
swapUsageBytes: 0
(This output is simplified, for full examples please look at the description of: kubernetes/kubernetes#118865)
Although not directly related to API changes since NFD is out-of-tree, bringing swap support to NFD is extremely valuable and relevant. It allows deferring the API changes discussion to the follow-up KEP that will focus on this subject specifically (for more info on the scope of this KEP and the follow-up KEPs look at the summary section above).
With NFD, the end-user would be able to easily understand which nodes have swap enabled. In the following example, only worker nodes 1 and 3 have swap enabled. The user would be possible to easily check which nodes have swap enabled by performing the following:
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-dev-control-plane Ready control-plane 78s v1.30.0
k8s-dev-worker1 Ready <none> 66s v1.30.0
k8s-dev-worker2 Ready <none> 66s v1.30.0
k8s-dev-worker3 Ready <none> 66s v1.30.0
> kubectl get nodes -o custom-columns=NAME:.metadata.name,LABELS:.metadata.labels | grep memory-swap:true | cut -d" " -f1
k8s-dev-worker1
k8s-dev-worker3
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
All existing tests needs to pass with and without swap enabled.
This KEP introduces minor additions of memory swap controlling configuration parameters.
- Kubelet configuration parameters are tested in the package
k8s.io/kubernetes/pkg/kubelet/apis/config/validation
- Passing parameters to runtime is tested in
k8s.io/kubernetes/pkg/kubelet/kuberuntime
Both packages has near 100% coverage and new functionality was covered.
In alpha2, tests will be extended in these packages to support kube-reserved swap settings.
NA.
These tasks require e2e test setup so we did not add any integration tests for this.
For alpha:
- Swap scenarios are enabled in test-infra for at least two Linux
distributions. e2e suites will be run against them.
- Container runtimes must be bumped in CI to use the new CRI.
- Data should be gathered from a number of use cases to guide beta graduation
and further development efforts.
- Focus should be on supported user stories as listed above.
Test grid tabs enabled:
- kubelet-gce-e2e-swap-ubuntu: Green
- kubelet-gce-e2e-swap-ubuntu-serial: Green
- kubelet-gce-e2e-swap-fedora: Green
- kubelet-gce-e2e-swap-fedora-serial: Green
No new e2e tests introduced.
For alpha2:
- Add e2e tests that exercise all available swap configurations via the CRI.
- Verify MemoryPressure behavior with swap enabled and document any changes for configuring eviction.
- Verify new system-reserved settings for swap memory.
For beta 1:
- Add e2e tests that verify pod-level control of swap utilization.
- Add e2e tests that verify swap performance with pods using a tmpfs.
For beta 2:
- Add Node-conformance tests for basic swap validation. To avoid disrupting node conformance lanes, only the cgroup knobs are validated to be defined as expected with no real memory stress or swap use.
For beta 3:
- We want e2e tests that can confirm that eviction will take in account swap usage
- Add a lane dedicated for swap testing, including stress tests and other tests that might be disruptive and intensive. These lanes are called "swap-conformance", and are (and should remain) consistently green:
For GA:
- Ensure that all e2e tests, especially swap-conformance tests, are consistently green.
- Kubelet can be started with swap enabled and will support two configurations
for Kubernetes workloads:
LimitedSwap
andNoSwap
. - Kubelet can configure CRI to allocate swap to Kubernetes workloads. By default, workloads will not be allocated any swap.
- e2e test jobs are configured for Linux systems with swap enabled.
In alpha2 the focus will be on making sure that the feature can be used on subset of production scenarios to collect more feedback before entering beta. Specifically, security and test coverage will be increased. As well as the new setting that will split swap between kubelet and workload will be introduced.
Once functionality part is resolved while in alpha, beta will be more about performance and feedback on wider range of scenarios.
This will allow to collect feedback from the following scenarios reasonably safe:
- on cgroupv2: allow host system processes to use swap to increase system reliability under memory pressure.
- enable swap for the workload in "single large pod per node" scenarios.
Here are specific improvements to be made:
- Address swap impact on memory-backed volumes: kubernetes/kubernetes#105978.
- Investigate swap security when enabling on system processes on the node.
- Improve coverage for appropriate scenarios in testgrid.
- Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
- Consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
- Investigate eviction behavior with swap enabled.
- Enable Swap Support using Burstable QoS Pods only.
- Enable Swap Support for Cgroup v2 Only.
- Add swap memory to the Kubelet stats api.
- Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled.
- Make sure node e2e jobs that use swap are healthy
- Improve coverage for appropriate scenarios in testgrid.
- Publish a Kubernetes doc page encouraging users to use encrypted swap if they wish to enable this feature.
- Add swap specific tests such as, handling the usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at (container cgroup boundary).
- Fix flaking/failing swap node e2e jobs.
- Address eviction related issue in swap implementation.
- Add
NoSwap
as the default setting. - Remove
UnlimitedSwap
as a supported option. - Add e2e test confirming that
NoSwap
will actually not swap - Add e2e test confirming that swap is used for
LimitedSwap
. - Document best practices for setting up Kubernetes with swap
- Enhance website documentation
- Docs should be close to GA quality before promoting to GA.
- Make eviction manager swap aware.
- Test a wide variety of scenarios that may be affected by swap support, including tests with aggressive memory stress.
- Address memory-backed backed volumes which should not have access to swap.
- Add documentation regarding encrypted swap.
- Test a wide variety of scenarios in which the node is being memory-stressed with different eviction and swap configurations. Ensure the behavior is consistent and reliable.
- Exclude high-priority, static and mirrored pods from gaining access to swap.
- Remove feature gate.
- Ensure swap-conformance tests are continuously green.
No changes are required on upgrade to maintain previous behaviour.
It is possible to downgrade a kubelet on a node that was using swap, but this
would require disabling the use of swap and setting swapoff
on the node.
Feature flag will apply to kubelet only, so version skew strategy is N/A.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: NodeSwap
- Components depending on the feature gate: API Server, Kubelet
- Other
- Describe the mechanism:
--fail-swap-on=false
flag for kubelet must also be set at kubelet start - Will enabling / disabling the feature require downtime of the control plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be restarted. Hence, there would be brief control component downtime on a given node.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? Yes. See above; disabling would require brief node downtime.
- Describe the mechanism:
No. If the feature flag is enabled, the user must still set
--fail-swap-on=false
to adjust the default behaviour.
In addition, since the default "swap behavior" is "NoSwap",
by default containers would not be able to access swap. Instead,
the administrator would need to set a non-default behavior in order
for swap to be accessible.
A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.
To turn this off, the kubelet would need to be restarted. If a cluster admin
wants to disable swap on the node without repartitioning the node, they could
stop the kubelet, set swapoff
on the node, and restart the kubelet with
--fail-swap-on=true
. The setting of the feature flag will be ignored in this
case.
In Beta2, we realize that we cannot rely on --fail-swap-on=false
as a flag for this feature. The flag predates this feature and it has
been used over time. We propose a configuration in MemorySwap
called NoSwap
.
Users could also set NoSwap
in MemorySwap
to disable all workloads from
using swap without requiring the user to disable swap if that is needed.
In Beta releases of this feature, one could use turn off NodeSwap
feature toggle
but once this feature is GA, users could use another option to disable swap
for workloads.
As described above, swap can be turned on and off, although kubelet would need to be restarted.
There are extensive tests to ensure that the swap feature as expected.
Unit tests are in place to test that this feature operates as expected with cgroup v1/v2, the feature gate being on/off, and different swap behaviors defined.
In addition, node e2e tests are added and run as part of the node-conformance suite. These tests ensure that the underlying cgroup knobs are being configured as expected.
Furthermore, "swap-conformance" periodic lanes have been introduced for the purpose testing swap on a stressed environment. These tests ensure that swap kicks in when expected, tested while stressing both on the node-level and container-level.
If a new node with swap memory fails to come online, it will not impact any running components.
It is possible that if a cluster administrator adds swap memory to an already running node, and then performs an in-place upgrade, the new kubelet could fail to start unless the configuration was modified to tolerate swap. However, we would expect that if a cluster admin is adding swap to the node, they will also update the kubelet's configuration to not fail with swap present.
Generally, it is considered best practice to add a swap memory partition at node image/boot time and not provision it dynamically after a kubelet is already running and reporting Ready on a node.
Workload churn or performance degradations on nodes. The metrics will be application/use-case specific, but we can provide some suggestions, based on the stability metrics identified earlier.
N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must be restarted with or without swap support.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource) which provide how and if swap is utilized in the node, pod and container level.
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource) which provide how and if swap is utilized in the node, pod and container level.
KubeletConfiguration has set failOnSwap: false
.
The prometheus node_exporter
will also export stats on swap memory
utilization.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource) which provide how and if swap is utilized in the node, pod and container level.
- Metrics
- Metric names:
container_swap_usage_bytes
pod_swap_usage_bytes
node_swap_usage_bytes
Components exposing the metric:/metrics/resource
endpoint
- Metric names:
node.swap.swapUsageBytes
node.swap.swapAvailableBytes
node.systemContainers.swap.swapUsageBytes
pods[i].swap.swapUsageBytes
pods[i].containers[i].swap.swapUsageBytes
Components exposing the metric:/stats/summary
endpoint
- Metric names:
- Other (treat as last resort)
- Details:
Swap is being managed by the kernel, depends on many factors and configurations that are outside of kubelet's reach like the nature of the workloads running on the node, swap capacity, memory capacity and other distro-specific configurations. However, generally:
- Nodes with swap enabled ->
node.swap.swapAvailableBytes
should be non-zero. - Nodes with memory pressure ->
node.swap.swapUsageBytes
should be non-zero. - Containers that reach their memory limit threshold ->
pods[i].containers[i].swap.swapUsageBytes
should be non-zero. - Pods with containers that reach their memory limit threshold ->
pods[i].swap.swapUsageBytes
should be non-zero.
Are there any missing metrics that would be useful to have to improve observability of this feature?
We added metrics to the node stats to report how much swap is used and the capacity of swap.
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
No.
No.
No.
The KubeletConfig API object may slightly increase in size due to new config fields.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes, enabling swap can affect performance of other critical daemons on the system. Any scenario where swap memory gets utilized is a result of system running out of physical RAM, or a container reaching its memory limit threshold. Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice along with reserving adequate enough system reserved memory, giving io latency precedence to the system.slice, and more. See #best practices for more info.
The SLI that could potentially be impacted is pod startup latency. If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted. In addition to this SLI, general areas around pod lifecycle (image pulls, sandbox creation, storage) could become slow.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.
No change. Feature is specific to individual nodes.
Individual nodes with swap memory enabled may experience performance degradations under load. This could potentially cause a cascading failure on nodes without swap: if nodes with swap fail Ready checks, workloads may be rescheduled en masse.
Thus, cluster administrators should be careful while enabling swap. To minimize disruption, you may want to taint nodes with swap available to protect against this problem. Taints will ensure that workloads which tolerate swap will not spill onto nodes without swap under load.
It is suggested that if nodes with swap memory enabled cause performance or stability degradations, those nodes are cordoned, drained, and replaced with nodes that do not use swap memory.
This is a partial list of everything that was done, but contains the most significant implementations.
- 2015-04-24: Discussed in #7294.
- 2017-10-06: Discussed in #53533.
- 2021-01-05: Initial design discussion document for swap support and use cases.
- 2021-04-05: Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).
- 2021-08-09: New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/.
- 2023-04-17: KEP update for beta1 #3957.
- 2023-07-18: Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap #118764.
- 2023-07-18: Add swap to stats to Summary API and Prometheus endpoints (/stats/summary and /metrics/resource) #118865.
- 2023-08-15: Beta1 released in kubernetes 1.28
- 2024-01-12: Updates to Beta2 KEP.
- 2024-01-08: Beta2 released in kubernetes 1.30.
- 2024-03-06: Add no swap as the default option for swap #122745.
- 2024-03-14: Add swap-specific (a.k.a. swap conformance) test lanes #32263.
- 2024-05-21: Add swap serial stress tests, improve NodeConformance tests and adapt NoSwap behavior #123557.
- 2024-05-23: Mount tmpfs memory-backed volumes with a noswap option if supported #124060.
- 2024-07-22: Restrict access to swap for containers in high priority Pods #125277.
- 2024-08-28: Updates to KEP, GA requirements and intention to release in version 1.32.
When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.
In general, swap is less predictable and might cause performance degradation. It also might be hard in certain scenarios to understand why certain workloads are the chosen candidates for swapping, which could occur for reasons external to the workload.
In addition, containers with memory limits would be killed less frequently since with swap enabled the kernel can usually reclaim a lot more memory. While this can help to avoid crashes, it could also "hide a problem" of a container reaching its memory limits.
When --fail-swap-on=false
is provided to Kubelet but swap is not configured
otherwise it is guaranteed that, by default, no Kubernetes workloads would
be able to utilize swap. However, everything outside of kubelet's reach
(e.g. system daemons, kubelet, etc) would be able to use swap.
Added the "swap-conformance" lane for extensive swap testing under node pressure: kubelet-swap-conformance-fedora-serial, kubelet-swap-conformance-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-ubuntu-serial).
See #e2e tests above for more information