Provide more flexible resource reservations for User Node Pools #1339

ondrejhlavacek · 2019-11-27T17:46:25Z

Running a staging AKS cluster with 3 Standard B2s nodes.

kubectl get nodes -o wide
NAME                       STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-agentpool-22372688-0   Ready    agent   62d   v1.13.9   10.240.0.4    <none>        Ubuntu 16.04.6 LTS   4.15.0-1063-azure   docker://3.0.6
aks-agentpool-22372688-1   Ready    agent   62d   v1.13.9   10.240.0.6    <none>        Ubuntu 16.04.6 LTS   4.15.0-1063-azure   docker://3.0.6
aks-agentpool-22372688-2   Ready    agent   62d   v1.13.9   10.240.0.5    <none>        Ubuntu 16.04.6 LTS   4.15.0-1063-azure   docker://3.0.6

Between 22/11/19 and 26/11/19 deployments stopped working. The new pods are in Pending state saying 0/3 nodes are available: 3 Insufficient memory..

I'd swear nothing else changed on our side, but I have no factual proofs, apart from a successful deploy pipeline from 22/11/19. Number of pods is still the same, that hasn't changed. I was able to run 2-3 times more pods on the same cluster previously. I vaguely remember that the nodes were waiting for a restart after a security/kernel update.

The current value of Capacity and Allocatable:

Capacity:
 attachable-volumes-azure-disk:  4
 cpu:                            2
 ephemeral-storage:              101584140Ki
 hugepages-1Gi:                  0
 hugepages-2Mi:                  0
 memory:                         4017572Ki
 pods:                           110
Allocatable:
 attachable-volumes-azure-disk:  4
 cpu:                            1931m
 ephemeral-storage:              93619943269
 hugepages-1Gi:                  0
 hugepages-2Mi:                  0
 memory:                         2200996Ki
 pods:                           110

Is there any chance the allocatable memory dropped after kernel update (which may have caused eg. aks-engine update?)

Thanks!

The text was updated successfully, but these errors were encountered:

welcome · 2019-11-27T17:46:27Z

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

ritazh · 2019-12-02T15:38:26Z

Looks like this is an AKS cluster. Transferring this issue there in case other users have similar behaviors.

ondrejhlavacek · 2019-12-02T17:10:03Z

@ritazh Oh, I'm sorry, thanks!

ondrejhlavacek · 2019-12-20T16:42:10Z

possibly related to #1216 and probably not happened during the period i have indicated

neoGeneva · 2020-03-03T09:55:41Z

Hey, I'm also having problems due to reduced memory on nodes.

I have two single node clusters on different versions of K8S, both have nodes with 4017088Ki capacity, the v1.10.3 clusters node has 3092416Ki allocatable but the v1.14.8 cluster only has 2200480Ki.

Looks like both the kube reserved and the eviction hard limit has increased, the older cluster has --eviction-hard=memory.available<100Mi --kube-reserved=memory=803Mi and the newer has --eviction-hard=memory.available<750Mi --kube-reserved=memory=1024Mi

As mentioned in #1216 having 45% reserved is pretty restrictive, is there any chance on having these values tweaked for low memory nodes?

worldspawn · 2020-06-11T06:33:33Z

Its outlined here I believe: https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations

Memory - memory utilized by AKS includes the sum of two values.
The kubelet daemon is installed on all Kubernetes agent nodes to manage container creation and termination. By default on AKS, this daemon has the following eviction rule: memory.available<750Mi, which means a node must always have at least 750 Mi allocatable at all times. When a host is below that threshold of available memory, the kubelet will terminate one of the running pods to free memory on the host machine and protect it. This is a reactive action once available memory decreases beyond the 750Mi threshold.

The second value is a regressive rate of memory reservations for the kubelet daemon to properly function (kube-reserved).

25% of the first 4 GB of memory
20% of the next 4 GB of memory (up to 8 GB)
10% of the next 8 GB of memory (up to 16 GB)
6% of the next 112 GB of memory (up to 128 GB)
2% of any memory above 128 GB

Seems a bit obscene really. Certainly something to factor in when weighing the real comparative costs between node sizes.

github-actions · 2020-07-21T01:32:59Z

Action required from @Azure/aks-pm

ghost · 2020-07-26T16:02:59Z

Action required from @Azure/aks-pm

palma21 · 2020-07-27T17:25:04Z

Dropping the same info from the linked issue. I will leave this issue open as the feature request for less aggressive reservations on User pools.

In terms of explanation why the difference I can't speak by the other cloud providers as I don't have visibility on the workloads and customers there, AKS is fairly conservative in what regards to protecting the cluster against "rogue" or misbehaved workloads, which have caused a lot of issues in the past when workloads can race faster for resources than even cgroups and slices can account for and we needed a larger buffer.

This, we acknowledge, can penalize well behaved workloads and users that would otherwise benefit from quite lenient default reservations. We didn't took this decision lightly but on the account of hundreds of cases where we saw these issues, and so far this has been a trade-off of running in this managed service scenario.

Nonetheless we're working on providing:

The ability to have lower reservation on User Pools vs System pools. https://docs.microsoft.com/en-us/azure/aks/use-system-pools

Considering the possibility to preview kubelet customizations as asked on the above item. There will always be a support tradeoff on these cases.

Until then, and if you're using multiple nodepools, you can already workaround this by a applying a similar daemon set as below to your User Pools.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    component: ds-reserve
  name: ds-reserve
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: ds-reserve
      tier: node
  template:
    metadata:
      labels:
        component: ds-reserve
        tier: node
    spec:
      containers:
      - command:
        - nsenter
        - --target
        - "1"
        - --mount
        - --uts
        - --ipc
        - --net
        - --pid
        - --
        - sh
        - -c
        - |
          sed -i 's/--kube-reserved=\S*/--kube-reserved=cpu=100m,memory=897Mi/' /etc/default/kubelet
          sed -i 's/--eviction-hard=\S*/--eviction-hard=memory.available<100Mi/' /etc/default/kubelet
          systemctl daemon-reload
          systemctl restart kubelet
          while true; do sleep 100000; done
        image: alpine
        imagePullPolicy: IfNotPresent
        name: ds-reserve
        resources:
          requests:
            cpu: 10m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
      nodeSelector:
        kubernetes.azure.com/mode: user
  updateStrategy:
    type: RollingUpdate

nwmcsween · 2023-01-06T09:18:28Z

@palma21 will the daemonset change result in an unsupported cluster as per the shared responsibilites doc? The limits really need to get revised or priced into the VMs as 34% loss of ram is severe.

kaarthis · 2023-03-10T19:43:52Z

@stl327 to comment on this and own it.

stl327 · 2023-11-06T20:54:41Z

AKS has released updated logic to our memory reservations for kube-reserved and the eviction threshold. These optimizations will increase the allocatable space for application workloads by up to 20%. Currently this applies for AKS 1.28. For more information, please see: https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations

ritazh transferred this issue from Azure/aks-engine Dec 2, 2019

ghost added the triage label Dec 2, 2019

github-actions bot added the action-required label Jul 21, 2020

ghost removed the triage label Jul 21, 2020

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020

palma21 changed the title ~~Allocatable memory probably suddenly decreased~~ Provide more flexible resource reservations for User Pools Jul 27, 2020

palma21 changed the title ~~Provide more flexible resource reservations for User Pools~~ Provide more flexible resource reservations for User Node Pools Jul 27, 2020

palma21 added feature-request Requested Features nodepools nodepools/mode and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Jul 27, 2020

palma21 self-assigned this Jul 27, 2020

ghost added the action-required label Jan 23, 2021

phealy removed the action-required label Sep 1, 2021

ghost added the action-required label Feb 28, 2022

nemobis mentioned this issue Mar 9, 2023

[BUG] kube-system pods reserve 35 % of allocatable memory on a 4 GB node #3525

Closed

kaarthis assigned stl327 Mar 10, 2023

ghost removed the action-required label Mar 10, 2023

RooMaiku added the addon/scaling Handling req/limit settings for AKS managed addon pods label Apr 6, 2023

stl327 closed this as completed Nov 6, 2023

aritraghosh added this to Azure Kubernetes Service Roadmap (Public) Jul 10, 2024

aritraghosh moved this to Archive (GA older than 1 month) in Azure Kubernetes Service Roadmap (Public) Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide more flexible resource reservations for User Node Pools #1339

Provide more flexible resource reservations for User Node Pools #1339

ondrejhlavacek commented Nov 27, 2019

welcome bot commented Nov 27, 2019

ritazh commented Dec 2, 2019

ondrejhlavacek commented Dec 2, 2019

ondrejhlavacek commented Dec 20, 2019

neoGeneva commented Mar 3, 2020

worldspawn commented Jun 11, 2020 •

edited

Loading

github-actions bot commented Jul 21, 2020

ghost commented Jul 26, 2020

palma21 commented Jul 27, 2020

nwmcsween commented Jan 6, 2023

kaarthis commented Mar 10, 2023

stl327 commented Nov 6, 2023

Provide more flexible resource reservations for User Node Pools #1339

Provide more flexible resource reservations for User Node Pools #1339

Comments

ondrejhlavacek commented Nov 27, 2019

welcome bot commented Nov 27, 2019

ritazh commented Dec 2, 2019

ondrejhlavacek commented Dec 2, 2019

ondrejhlavacek commented Dec 20, 2019

neoGeneva commented Mar 3, 2020

worldspawn commented Jun 11, 2020 • edited Loading

github-actions bot commented Jul 21, 2020

ghost commented Jul 26, 2020

palma21 commented Jul 27, 2020

nwmcsween commented Jan 6, 2023

kaarthis commented Mar 10, 2023

stl327 commented Nov 6, 2023

worldspawn commented Jun 11, 2020 •

edited

Loading