|
| 1 | +# HugePages support in Kubernetes |
| 2 | + |
| 3 | +**Authors** |
| 4 | +* Derek Carr (@derekwaynecarr) |
| 5 | +* Seth Jennings (@sjenning) |
| 6 | +* Piotr Prokop (@PiotrProkop) |
| 7 | + |
| 8 | +**Status**: In progress |
| 9 | + |
| 10 | +## Abstract |
| 11 | + |
| 12 | +A proposal to enable applications running in a Kubernetes cluster to use huge |
| 13 | +pages. |
| 14 | + |
| 15 | +A pod may request a number of huge pages. The `scheduler` is able to place the |
| 16 | +pod on a node that can satisfy that request. The `kubelet` advertises an |
| 17 | +allocatable number of huge pages to support scheduling decisions. A pod may |
| 18 | +consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not |
| 19 | +overcommitted. |
| 20 | + |
| 21 | +## Motivation |
| 22 | + |
| 23 | +Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi |
| 24 | +of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have |
| 25 | +a built-in memory management unit that manages a list of these pages in |
| 26 | +hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of |
| 27 | +virtual-to-physical page mappings. If the virtual address passed in a hardware |
| 28 | +instruction can be found in the TLB, the mapping can be determined quickly. If |
| 29 | +not, a TLB miss occurs, and the system falls back to slower, software based |
| 30 | +address translation. This results in performance issues. Since the size of the |
| 31 | +TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the |
| 32 | +page size. |
| 33 | + |
| 34 | +A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, |
| 35 | +there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other |
| 36 | +architectures, but the idea is the same. In order to use huge pages, |
| 37 | +application must write code that is aware of them. Transparent huge pages (THP) |
| 38 | +attempts to automate the management of huge pages without application knowledge, |
| 39 | +but they have limitations. In particular, they are limited to 2Mi page sizes. |
| 40 | +THP might lead to performance degradation on nodes with high memory utilization |
| 41 | +or fragmentation due to defragmenting efforts of THP, which can lock memory |
| 42 | +pages. For this reason, some applications may be designed to (or recommend) |
| 43 | +usage of pre-allocated huge pages instead of THP. |
| 44 | + |
| 45 | +Managing memory is hard, and unfortunately, there is no one-size fits all |
| 46 | +solution for all applications. |
| 47 | + |
| 48 | +## Scope |
| 49 | + |
| 50 | +This proposal only includes pre-allocated huge pages configured on the node by |
| 51 | +the administrator at boot time or by manual dynamic allocation. It does not |
| 52 | +discuss how the cluster could dynamically attempt to allocate huge pages in an |
| 53 | +attempt to find a fit for a pod pending scheduling. It is anticipated that |
| 54 | +operators may use a variety of strategies to allocate huge pages, but we do not |
| 55 | +anticipate the kubelet itself doing the allocation. Allocation of huge pages |
| 56 | +ideally happens soon after boot time. |
| 57 | + |
| 58 | +This proposal defers issues relating to NUMA. |
| 59 | + |
| 60 | +## Use Cases |
| 61 | + |
| 62 | +The class of applications that benefit from huge pages typically have |
| 63 | +- A large memory working set |
| 64 | +- A sensitivity to memory access latency |
| 65 | + |
| 66 | +Example applications include: |
| 67 | +- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.) |
| 68 | +- Java applications can back the heap with huge pages using the |
| 69 | + `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options. |
| 70 | +- packet processing systems (DPDK) |
| 71 | + |
| 72 | +Applications can generally use huge pages by calling |
| 73 | +- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory |
| 74 | +- `mmap()` a file backed by `hugetlbfs` |
| 75 | +- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known |
| 76 | + Issues). |
| 77 | + |
| 78 | +1. A pod can use huge pages with any of the prior described methods. |
| 79 | +1. A pod can request huge pages. |
| 80 | +1. A scheduler can bind pods to nodes that have available huge pages. |
| 81 | +1. A quota may limit usage of huge pages. |
| 82 | +1. A limit range may constrain min and max huge page requests. |
| 83 | + |
| 84 | +## Feature Gate |
| 85 | + |
| 86 | +The proposal introduces huge pages as an Alpha feature. |
| 87 | + |
| 88 | +It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent |
| 89 | +components pending graduation to Beta. |
| 90 | + |
| 91 | +## Node Specfication |
| 92 | + |
| 93 | +Huge pages cannot be overcommitted on a node. |
| 94 | + |
| 95 | +A system may support multiple huge page sizes. It is assumed that most nodes |
| 96 | +will be configured to primarily use the default huge page size as returned via |
| 97 | +`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems |
| 98 | +unless overriden by `default_hugepagesz=1g` in kernel boot parameters. |
| 99 | + |
| 100 | +For each supported huge page size, the node will advertise a resource of the |
| 101 | +form `hugepages-<hugepagesize>`. On Linux, supported huge page sizes are |
| 102 | +determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB` |
| 103 | +directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>` |
| 104 | +resource using binary notation form. It will convert `<hugepagesize>` into the |
| 105 | +most compact binary notation using integer values. For example, if a node |
| 106 | +supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node |
| 107 | +capacity and allocatable values. Operators may set aside pre-allocated huge |
| 108 | +pages that are not available for user pods similar to normal memory via the |
| 109 | +`--system-reserved` flag. |
| 110 | + |
| 111 | +There are a variety of huge page sizes supported across different hardware |
| 112 | +architectures. It is preferred to have a resource per size in order to better |
| 113 | +support quota. For example, 1 huge page with size 2Mi is orders of magnitude |
| 114 | +different than 1 huge page with size 1Gi. We assume gigantic pages are even |
| 115 | +more precious resources than huge pages. |
| 116 | + |
| 117 | +Pre-allocated huge pages reduce the amount of allocatable memory on a node. The |
| 118 | +node will treat pre-allocated huge pages similar to other system reservations |
| 119 | +and reduce the amount of `memory` it reports using the following formula: |
| 120 | + |
| 121 | +``` |
| 122 | +[Allocatable] = [Node Capacity] - |
| 123 | + [Kube-Reserved] - |
| 124 | + [System-Reserved] - |
| 125 | + [Pre-Allocated-HugePages * HugePageSize] - |
| 126 | + [Hard-Eviction-Threshold] |
| 127 | +``` |
| 128 | + |
| 129 | +The following represents a machine with 10Gi of memory. 1Gi of memory has been |
| 130 | +reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the |
| 131 | +allocatable memory has been reduced to account for the amount of huge pages |
| 132 | +reserved. |
| 133 | + |
| 134 | +``` |
| 135 | +apiVersion: v1 |
| 136 | +kind: Node |
| 137 | +metadata: |
| 138 | + name: node1 |
| 139 | +... |
| 140 | +status: |
| 141 | + capacity: |
| 142 | + memory: 10Gi |
| 143 | + hugepages-2Mi: 1Gi |
| 144 | + allocatable: |
| 145 | + memory: 9Gi |
| 146 | + hugepages-2Mi: 1Gi |
| 147 | +... |
| 148 | +``` |
| 149 | + |
| 150 | +## Pod Specification |
| 151 | + |
| 152 | +A pod must make a request to consume pre-allocated huge pages using the resource |
| 153 | +`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in |
| 154 | +bytes. The specified amount must align with the `<hugepagesize>`; otherwise, |
| 155 | +the pod will fail validation. For example, it would be valid to request |
| 156 | +`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`. |
| 157 | + |
| 158 | +The request and limit for `hugepages-<hugepagesize>` must match. Similar to |
| 159 | +memory, an application that requests `hugepages-<hugepagesize>` resource is at |
| 160 | +minimum in the `Burstable` QoS class. |
| 161 | + |
| 162 | +If a pod consumes huge pages via `shmget`, it must run with a supplemental group |
| 163 | +that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of |
| 164 | +this group is outside the scope of this specification. |
| 165 | + |
| 166 | +Initially, a pod may not consume multiple huge page sizes in a single pod spec. |
| 167 | +Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will |
| 168 | +fail validation. We believe it is rare for applications to attempt to use |
| 169 | +multiple huge page sizes. This restriction may be lifted in the future with |
| 170 | +community presented use cases. Introducing the feature with this restriction |
| 171 | +limits the exposure of API changes needed when consuming huge pages via volumes. |
| 172 | + |
| 173 | +In order to consume huge pages backed by the `hugetlbfs` filesystem inside the |
| 174 | +specified container in the pod, it is helpful to understand the set of mount |
| 175 | +options used with `hugetlbfs`. For more details, see "Using Huge Pages" here: |
| 176 | +https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt |
| 177 | + |
| 178 | +``` |
| 179 | +mount -t hugetlbfs \ |
| 180 | + -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ |
| 181 | + min_size=<value>,nr_inodes=<value> none /mnt/huge |
| 182 | +``` |
| 183 | + |
| 184 | +The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy |
| 185 | +this use case. A new `medium=HugePages` option would be supported. To write |
| 186 | +into this volume, the pod must make a request for huge pages. The `pagesize` |
| 187 | +argument is inferred from the `hugepages-<hugepagesize>` from the resource |
| 188 | +request. If in the future, multiple huge page sizes are supported in a single |
| 189 | +pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page |
| 190 | +size. The existing `sizeLimit` option for `emptyDir` would restrict usage to |
| 191 | +the minimum value specified between `sizeLimit` and the sum of huge page limits |
| 192 | +of all containers in a pod. This keeps the behavior consistent with memory |
| 193 | +backed `emptyDir` volumes whose usage is ultimately constrained by the pod |
| 194 | +cgroup sandbox memory settings. The `min_size` option is omitted as its not |
| 195 | +necessary. The `nr_inodes` mount option is omitted at this time in the same |
| 196 | +manner it is omitted with `medium=Memory` when using `tmpfs`. |
| 197 | + |
| 198 | +The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It |
| 199 | +can consume those pages using `shmget()` or via `mmap()` with the specified |
| 200 | +volume. |
| 201 | + |
| 202 | +``` |
| 203 | +apiVersion: v1 |
| 204 | +kind: Pod |
| 205 | +metadata: |
| 206 | + name: example |
| 207 | +spec: |
| 208 | + containers: |
| 209 | +... |
| 210 | + volumeMounts: |
| 211 | + - mountPath: /hugepages |
| 212 | + name: hugepage |
| 213 | + resources: |
| 214 | + requests: |
| 215 | + hugepages-2Mi: 1Gi |
| 216 | + limits: |
| 217 | + hugepages-2Mi: 1Gi |
| 218 | + volumes: |
| 219 | + - name: hugepage |
| 220 | + emptyDir: |
| 221 | + medium: HugePages |
| 222 | +``` |
| 223 | + |
| 224 | +## CRI Updates |
| 225 | + |
| 226 | +The `LinuxContainerResources` message should be extended to support specifying |
| 227 | +huge page limits per size. The specification for huge pages should align with |
| 228 | +opencontainers/runtime-spec. |
| 229 | + |
| 230 | +see: |
| 231 | +https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits |
| 232 | + |
| 233 | +The CRI changes are required before promoting this feature to Beta. |
| 234 | + |
| 235 | +## Cgroup Enforcement |
| 236 | + |
| 237 | +To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the |
| 238 | +`hugetlb` cgroup must be mounted. |
| 239 | + |
| 240 | +The `kubepods` cgroup is bounded by the `Allocatable` value. |
| 241 | + |
| 242 | +The QoS level cgroups are left unbounded across all huge page pool sizes. |
| 243 | + |
| 244 | +The pod level cgroup sandbox is configured as follows, where `hugepagesize` is |
| 245 | +the system supported huge page size(s). If no request is made for huge pages of |
| 246 | +a particular size, the limit is set to 0 for all supported types on the node. |
| 247 | + |
| 248 | +``` |
| 249 | +pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>]) |
| 250 | +``` |
| 251 | + |
| 252 | +If the container runtime supports specification of huge page limits, the |
| 253 | +container cgroup sandbox will be configured with the specified limit. |
| 254 | + |
| 255 | +The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level |
| 256 | +cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed. |
| 257 | + |
| 258 | +## Limits and Quota |
| 259 | + |
| 260 | +The `ResourceQuota` resource will be extended to support accounting for |
| 261 | +`hugepages-<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange` |
| 262 | +resource will be extended to define min and max constraints for `hugepages` |
| 263 | +similar to `cpu` and `memory`. |
| 264 | + |
| 265 | +## Scheduler changes |
| 266 | + |
| 267 | +The scheduler will need to ensure any huge page request defined in the pod spec |
| 268 | +can be fulfilled by a candidate node. |
| 269 | + |
| 270 | +## cAdvisor changes |
| 271 | + |
| 272 | +cAdvisor will need to be modified to return the number of pre-allocated huge |
| 273 | +pages per page size on the node. It will be used to determine capacity and |
| 274 | +calculate allocatable values on the node. |
| 275 | + |
| 276 | +## Roadmap |
| 277 | + |
| 278 | +### Version 1.8 |
| 279 | + |
| 280 | +Initial alpha support for huge pages usage by pods. |
| 281 | + |
| 282 | +### Version 1.9 |
| 283 | + |
| 284 | +Resource Quota support. Limit Range support. Beta support for huge pages |
| 285 | +(pending community feedback) |
| 286 | + |
| 287 | +## Known Issues |
| 288 | + |
| 289 | +### Huge pages as shared memory |
| 290 | + |
| 291 | +For the Java use case, the JVM maps the huge pages as a shared memory segment |
| 292 | +and memlocks them to prevent the system from moving or swapping them out. |
| 293 | + |
| 294 | +There are several issues here: |
| 295 | +- The user running the Java app must be a member of the gid set in the |
| 296 | + `vm.huge_tlb_shm_group` sysctl |
| 297 | +- sysctl `kernel.shmmax` must allow the size of the shared memory segment |
| 298 | +- The user's memlock ulimits must allow the size of the shared memory segment |
| 299 | +- `vm.huge_tlb_shm_group` is not namespaced. |
| 300 | + |
| 301 | +### NUMA |
| 302 | + |
| 303 | +NUMA is complicated. To support NUMA, the node must support cpu pinning, |
| 304 | +devices, and memory locality. Extending that requirement to huge pages is not |
| 305 | +much different. It is anticipated that the `kubelet` will provide future NUMA |
| 306 | +locality guarantees as a feature of QoS. In particular, pods in the |
| 307 | +`Guaranteed` QoS class are expected to have NUMA locality preferences. |
| 308 | + |
0 commit comments