Skip to content

Commit 5b2d021

Browse files
Kubernetes Submit Queuecjwagner
Kubernetes Submit Queue
authored andcommitted
Merge pull request #837 from derekwaynecarr/hugepages
Automatic merge from submit-queue HugePages proposal A pod may request a number of huge pages. The `scheduler` is able to place the pod on a node that can satisfy that request. The `kubelet` advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages via `hugetlbfs` or `shmget`. Planned as Alpha for Kubernetes 1.8 release. See feature: kubernetes/enhancements#275 @kubernetes/sig-scheduling-feature-requests @kubernetes/sig-scheduling-misc @kubernetes/sig-node-proposals @kubernetes/api-approvers @kubernetes/api-reviewers
2 parents 1398a55 + d4c6d1f commit 5b2d021

File tree

1 file changed

+308
-0
lines changed

1 file changed

+308
-0
lines changed

hugepages.md

+308
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# HugePages support in Kubernetes
2+
3+
**Authors**
4+
* Derek Carr (@derekwaynecarr)
5+
* Seth Jennings (@sjenning)
6+
* Piotr Prokop (@PiotrProkop)
7+
8+
**Status**: In progress
9+
10+
## Abstract
11+
12+
A proposal to enable applications running in a Kubernetes cluster to use huge
13+
pages.
14+
15+
A pod may request a number of huge pages. The `scheduler` is able to place the
16+
pod on a node that can satisfy that request. The `kubelet` advertises an
17+
allocatable number of huge pages to support scheduling decisions. A pod may
18+
consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not
19+
overcommitted.
20+
21+
## Motivation
22+
23+
Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi
24+
of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have
25+
a built-in memory management unit that manages a list of these pages in
26+
hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of
27+
virtual-to-physical page mappings. If the virtual address passed in a hardware
28+
instruction can be found in the TLB, the mapping can be determined quickly. If
29+
not, a TLB miss occurs, and the system falls back to slower, software based
30+
address translation. This results in performance issues. Since the size of the
31+
TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the
32+
page size.
33+
34+
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures,
35+
there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other
36+
architectures, but the idea is the same. In order to use huge pages,
37+
application must write code that is aware of them. Transparent huge pages (THP)
38+
attempts to automate the management of huge pages without application knowledge,
39+
but they have limitations. In particular, they are limited to 2Mi page sizes.
40+
THP might lead to performance degradation on nodes with high memory utilization
41+
or fragmentation due to defragmenting efforts of THP, which can lock memory
42+
pages. For this reason, some applications may be designed to (or recommend)
43+
usage of pre-allocated huge pages instead of THP.
44+
45+
Managing memory is hard, and unfortunately, there is no one-size fits all
46+
solution for all applications.
47+
48+
## Scope
49+
50+
This proposal only includes pre-allocated huge pages configured on the node by
51+
the administrator at boot time or by manual dynamic allocation. It does not
52+
discuss how the cluster could dynamically attempt to allocate huge pages in an
53+
attempt to find a fit for a pod pending scheduling. It is anticipated that
54+
operators may use a variety of strategies to allocate huge pages, but we do not
55+
anticipate the kubelet itself doing the allocation. Allocation of huge pages
56+
ideally happens soon after boot time.
57+
58+
This proposal defers issues relating to NUMA.
59+
60+
## Use Cases
61+
62+
The class of applications that benefit from huge pages typically have
63+
- A large memory working set
64+
- A sensitivity to memory access latency
65+
66+
Example applications include:
67+
- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.)
68+
- Java applications can back the heap with huge pages using the
69+
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
70+
- packet processing systems (DPDK)
71+
72+
Applications can generally use huge pages by calling
73+
- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
74+
- `mmap()` a file backed by `hugetlbfs`
75+
- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known
76+
Issues).
77+
78+
1. A pod can use huge pages with any of the prior described methods.
79+
1. A pod can request huge pages.
80+
1. A scheduler can bind pods to nodes that have available huge pages.
81+
1. A quota may limit usage of huge pages.
82+
1. A limit range may constrain min and max huge page requests.
83+
84+
## Feature Gate
85+
86+
The proposal introduces huge pages as an Alpha feature.
87+
88+
It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent
89+
components pending graduation to Beta.
90+
91+
## Node Specfication
92+
93+
Huge pages cannot be overcommitted on a node.
94+
95+
A system may support multiple huge page sizes. It is assumed that most nodes
96+
will be configured to primarily use the default huge page size as returned via
97+
`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems
98+
unless overriden by `default_hugepagesz=1g` in kernel boot parameters.
99+
100+
For each supported huge page size, the node will advertise a resource of the
101+
form `hugepages-<hugepagesize>`. On Linux, supported huge page sizes are
102+
determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB`
103+
directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>`
104+
resource using binary notation form. It will convert `<hugepagesize>` into the
105+
most compact binary notation using integer values. For example, if a node
106+
supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node
107+
capacity and allocatable values. Operators may set aside pre-allocated huge
108+
pages that are not available for user pods similar to normal memory via the
109+
`--system-reserved` flag.
110+
111+
There are a variety of huge page sizes supported across different hardware
112+
architectures. It is preferred to have a resource per size in order to better
113+
support quota. For example, 1 huge page with size 2Mi is orders of magnitude
114+
different than 1 huge page with size 1Gi. We assume gigantic pages are even
115+
more precious resources than huge pages.
116+
117+
Pre-allocated huge pages reduce the amount of allocatable memory on a node. The
118+
node will treat pre-allocated huge pages similar to other system reservations
119+
and reduce the amount of `memory` it reports using the following formula:
120+
121+
```
122+
[Allocatable] = [Node Capacity] -
123+
[Kube-Reserved] -
124+
[System-Reserved] -
125+
[Pre-Allocated-HugePages * HugePageSize] -
126+
[Hard-Eviction-Threshold]
127+
```
128+
129+
The following represents a machine with 10Gi of memory. 1Gi of memory has been
130+
reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the
131+
allocatable memory has been reduced to account for the amount of huge pages
132+
reserved.
133+
134+
```
135+
apiVersion: v1
136+
kind: Node
137+
metadata:
138+
name: node1
139+
...
140+
status:
141+
capacity:
142+
memory: 10Gi
143+
hugepages-2Mi: 1Gi
144+
allocatable:
145+
memory: 9Gi
146+
hugepages-2Mi: 1Gi
147+
...
148+
```
149+
150+
## Pod Specification
151+
152+
A pod must make a request to consume pre-allocated huge pages using the resource
153+
`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in
154+
bytes. The specified amount must align with the `<hugepagesize>`; otherwise,
155+
the pod will fail validation. For example, it would be valid to request
156+
`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`.
157+
158+
The request and limit for `hugepages-<hugepagesize>` must match. Similar to
159+
memory, an application that requests `hugepages-<hugepagesize>` resource is at
160+
minimum in the `Burstable` QoS class.
161+
162+
If a pod consumes huge pages via `shmget`, it must run with a supplemental group
163+
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of
164+
this group is outside the scope of this specification.
165+
166+
Initially, a pod may not consume multiple huge page sizes in a single pod spec.
167+
Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will
168+
fail validation. We believe it is rare for applications to attempt to use
169+
multiple huge page sizes. This restriction may be lifted in the future with
170+
community presented use cases. Introducing the feature with this restriction
171+
limits the exposure of API changes needed when consuming huge pages via volumes.
172+
173+
In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
174+
specified container in the pod, it is helpful to understand the set of mount
175+
options used with `hugetlbfs`. For more details, see "Using Huge Pages" here:
176+
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
177+
178+
```
179+
mount -t hugetlbfs \
180+
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
181+
min_size=<value>,nr_inodes=<value> none /mnt/huge
182+
```
183+
184+
The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy
185+
this use case. A new `medium=HugePages` option would be supported. To write
186+
into this volume, the pod must make a request for huge pages. The `pagesize`
187+
argument is inferred from the `hugepages-<hugepagesize>` from the resource
188+
request. If in the future, multiple huge page sizes are supported in a single
189+
pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page
190+
size. The existing `sizeLimit` option for `emptyDir` would restrict usage to
191+
the minimum value specified between `sizeLimit` and the sum of huge page limits
192+
of all containers in a pod. This keeps the behavior consistent with memory
193+
backed `emptyDir` volumes whose usage is ultimately constrained by the pod
194+
cgroup sandbox memory settings. The `min_size` option is omitted as its not
195+
necessary. The `nr_inodes` mount option is omitted at this time in the same
196+
manner it is omitted with `medium=Memory` when using `tmpfs`.
197+
198+
The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It
199+
can consume those pages using `shmget()` or via `mmap()` with the specified
200+
volume.
201+
202+
```
203+
apiVersion: v1
204+
kind: Pod
205+
metadata:
206+
name: example
207+
spec:
208+
containers:
209+
...
210+
volumeMounts:
211+
- mountPath: /hugepages
212+
name: hugepage
213+
resources:
214+
requests:
215+
hugepages-2Mi: 1Gi
216+
limits:
217+
hugepages-2Mi: 1Gi
218+
volumes:
219+
- name: hugepage
220+
emptyDir:
221+
medium: HugePages
222+
```
223+
224+
## CRI Updates
225+
226+
The `LinuxContainerResources` message should be extended to support specifying
227+
huge page limits per size. The specification for huge pages should align with
228+
opencontainers/runtime-spec.
229+
230+
see:
231+
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits
232+
233+
The CRI changes are required before promoting this feature to Beta.
234+
235+
## Cgroup Enforcement
236+
237+
To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the
238+
`hugetlb` cgroup must be mounted.
239+
240+
The `kubepods` cgroup is bounded by the `Allocatable` value.
241+
242+
The QoS level cgroups are left unbounded across all huge page pool sizes.
243+
244+
The pod level cgroup sandbox is configured as follows, where `hugepagesize` is
245+
the system supported huge page size(s). If no request is made for huge pages of
246+
a particular size, the limit is set to 0 for all supported types on the node.
247+
248+
```
249+
pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>])
250+
```
251+
252+
If the container runtime supports specification of huge page limits, the
253+
container cgroup sandbox will be configured with the specified limit.
254+
255+
The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level
256+
cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed.
257+
258+
## Limits and Quota
259+
260+
The `ResourceQuota` resource will be extended to support accounting for
261+
`hugepages-<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange`
262+
resource will be extended to define min and max constraints for `hugepages`
263+
similar to `cpu` and `memory`.
264+
265+
## Scheduler changes
266+
267+
The scheduler will need to ensure any huge page request defined in the pod spec
268+
can be fulfilled by a candidate node.
269+
270+
## cAdvisor changes
271+
272+
cAdvisor will need to be modified to return the number of pre-allocated huge
273+
pages per page size on the node. It will be used to determine capacity and
274+
calculate allocatable values on the node.
275+
276+
## Roadmap
277+
278+
### Version 1.8
279+
280+
Initial alpha support for huge pages usage by pods.
281+
282+
### Version 1.9
283+
284+
Resource Quota support. Limit Range support. Beta support for huge pages
285+
(pending community feedback)
286+
287+
## Known Issues
288+
289+
### Huge pages as shared memory
290+
291+
For the Java use case, the JVM maps the huge pages as a shared memory segment
292+
and memlocks them to prevent the system from moving or swapping them out.
293+
294+
There are several issues here:
295+
- The user running the Java app must be a member of the gid set in the
296+
`vm.huge_tlb_shm_group` sysctl
297+
- sysctl `kernel.shmmax` must allow the size of the shared memory segment
298+
- The user's memlock ulimits must allow the size of the shared memory segment
299+
- `vm.huge_tlb_shm_group` is not namespaced.
300+
301+
### NUMA
302+
303+
NUMA is complicated. To support NUMA, the node must support cpu pinning,
304+
devices, and memory locality. Extending that requirement to huge pages is not
305+
much different. It is anticipated that the `kubelet` will provide future NUMA
306+
locality guarantees as a feature of QoS. In particular, pods in the
307+
`Guaranteed` QoS class are expected to have NUMA locality preferences.
308+

0 commit comments

Comments
 (0)