Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] kube-system pods reserve 35 % of allocatable memory on a 4 GB node #3525

Closed
nemobis opened this issue Mar 9, 2023 · 16 comments
Closed
Assignees
Labels
addon/scaling Handling req/limit settings for AKS managed addon pods bug known-issue

Comments

@nemobis
Copy link

nemobis commented Mar 9, 2023

Describe the bug
On AKS with kubernetes 1.24, a node with 4 GB RAM capacity only has 2157 MiB allocatable; yet kube-system alone reserves some 750 MB (of which 550 MB for azure-cns and azure-npm), leaving less than 1400 MiB available for requests by others.

To Reproduce
Steps to reproduce the behavior:

  1. Create a node pool with nodes having 4 GB memory
  2. Check kube-capacity or kubectl describe node on a recently created node
  3. Optionally inspect actual resource usage over time with the node-exporter metrics on Prometheus and something like the Kubernetes Monitor Grafana dashboard

Example node:

Addresses:
  InternalIP:  10.<redacted>
  Hostname:    aks-userpool2-11<redacted>
Capacity:
  cpu:                2
  ephemeral-storage:  259966896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             4025836Ki
  pods:               100
Allocatable:
  cpu:                1900m
  ephemeral-storage:  239585490957
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2209260Ki
  pods:               100
System Info:
  Machine ID:                 e619<redacted>
  System UUID:                95e<redacted>
  Boot ID:                    5627<redacted>
  Kernel Version:             5.4.0-1098-azure
  OS Image:                   Ubuntu 18.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.4+azure-4
  Kubelet Version:            v1.24.6
  Kube-Proxy Version:         v1.24.6
ProviderID:                   azure:///<redacted>
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                  ------------  ----------  ---------------  -------------  ---
  datadog-agent               datadog-agent-<redacted>                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
  kube-system                 azure-cns-<redacted>                                   40m (2%)      40m (2%)    250Mi (11%)      250Mi (11%)    38m
  kube-system                 azure-npm-<redacted>                                   250m (13%)    251m (13%)  300Mi (13%)      400Mi (18%)    38m
  kube-system                 cloud-node-manager-<redacted>                          50m (2%)      0 (0%)      50Mi (2%)        512Mi (23%)    38m
  kube-system                 csi-azuredisk-node-<redacted>                          30m (1%)      0 (0%)      60Mi (2%)        400Mi (18%)    38m
  kube-system                 csi-azurefile-node-<redacted>                         30m (1%)      0 (0%)      60Mi (2%)        600Mi (27%)    38m
  kube-system                 kube-proxy-<redacted>                                 100m (5%)     0 (0%)      0 (0%)           0 (0%)         38m
  kube-system                 node-local-dns-<redacted>                              25m (1%)      0 (0%)      5Mi (0%)         0 (0%)         38m
...

Expected behavior
A node with 4 GB of RAM should be able to be assigned a pod which requests 1600 MB of RAM (e.g. for Prometheus). (I'm not talking of limits.)

Screenshots
Screenshot_20230309_120410

Environment (please complete the following information):

  • Kubernetes version: 1.24

Additional context

There's been a lot of discussion about what the requests and limits should be for various components, but in this case the issue is only with the value of the allocatable memory, so I believe it's orthogonal. If everything in kube-system is requesting way more memory than it needs most of the time, there's no need for such a huge buffer. At very least it should be configurable, or the really available memory should be made clearer so that people can configure their loads and nodepools accordingly, without tinkering with eviction thresholds.

#1339
#2125
#3348
#3496

I think it's unrelated from #3443

@nemobis nemobis added the bug label Mar 9, 2023
@nemobis
Copy link
Author

nemobis commented Mar 23, 2023

Some of the overall settings here are supposed to be configurable in kubernetes, see e.g. https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#enforcing-node-allocatable , but they don't seem to be on AKS according to what I heard so far from Azure Support.

@FlorentATo
Copy link

FlorentATo commented Apr 24, 2023

@nemobis you can check the configuration of kubelet by yourself by running a debug pod on the node and look at the process snapshot:

➜  ~ kubectl debug node/aks-systempool-21850828-vmss000000 -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0
Creating debugging pod node-debugger-aks-systempool-21850828-vmss000000-chs4s with container debugger on node aks-systempool-21850828-vmss000000.
If you don't see a command prompt, try pressing enter.
root@aks-systempool-21850828-vmss000000:/# chroot /host
# bash
root@aks-systempool-21850828-vmss000000:/# ps fauxww | grep '/usr/local/bin/kubelet'

I ran into the same "issue"; using a VM with only 4GiB of memory (Standard F2S v2) returns the following:

➜  ~ k describe node aks-systempool-21850828-vmss000000
(...)
Capacity:
  cpu:                2
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             4025836Ki
  pods:               110
Allocatable:
  cpu:                1900m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2209260Ki
  pods:               110

According to the documentation, kubelet will reserve 25% of memory (i.e. 1GiB).

Indeed, using the method described above, you can see kubelet runs with the following flags:

  • --kube-reserved=cpu=100m,memory=1024Mi,pid=1000
  • --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%,pid.available<2000

So in total 1816576kiB of memory is reserved; and thus: 4025836-1816576=2209260KiB i.e the amount reported by AKS.

@ghost ghost added the action-required label May 19, 2023
@ghost
Copy link

ghost commented May 24, 2023

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label May 24, 2023
@ghost
Copy link

ghost commented Jun 9, 2023

Issue needing attention of @Azure/aks-leads

4 similar comments
@ghost
Copy link

ghost commented Jun 24, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 9, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 24, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 8, 2023

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

6 similar comments
Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@stl327 stl327 self-assigned this May 17, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels May 17, 2024
@stl327
Copy link
Contributor

stl327 commented May 20, 2024

Hello, beginning with AKS 1.29 preview and beyond, we shipped changes to the eviction threshold and memory reservation for kube-reserved. The new rate of memory reservations is set according to the lesser value of: 20MB * Max Pods supported on the Node + 50MB or 25% of the total system memory resources. The new eviction threshold is 100Mi. See more information here. These changes will help reduce the resource consumption by AKS and can deliver up to 20% more allocatable space depending on your pod configuration. Thanks!

@stl327 stl327 closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addon/scaling Handling req/limit settings for AKS managed addon pods bug known-issue
Projects
None yet
Development

No branches or pull requests

4 participants