Skip to content

Provide more flexible resource reservations for User Node Pools #1339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ondrejhlavacek opened this issue Nov 27, 2019 · 12 comments
Closed

Provide more flexible resource reservations for User Node Pools #1339

ondrejhlavacek opened this issue Nov 27, 2019 · 12 comments
Assignees
Labels
addon/scaling Handling req/limit settings for AKS managed addon pods feature-request Requested Features nodepools/mode nodepools

Comments

@ondrejhlavacek
Copy link

Running a staging AKS cluster with 3 Standard B2s nodes.

kubectl get nodes -o wide
NAME                       STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-agentpool-22372688-0   Ready    agent   62d   v1.13.9   10.240.0.4    <none>        Ubuntu 16.04.6 LTS   4.15.0-1063-azure   docker://3.0.6
aks-agentpool-22372688-1   Ready    agent   62d   v1.13.9   10.240.0.6    <none>        Ubuntu 16.04.6 LTS   4.15.0-1063-azure   docker://3.0.6
aks-agentpool-22372688-2   Ready    agent   62d   v1.13.9   10.240.0.5    <none>        Ubuntu 16.04.6 LTS   4.15.0-1063-azure   docker://3.0.6

Between 22/11/19 and 26/11/19 deployments stopped working. The new pods are in Pending state saying 0/3 nodes are available: 3 Insufficient memory..

I'd swear nothing else changed on our side, but I have no factual proofs, apart from a successful deploy pipeline from 22/11/19. Number of pods is still the same, that hasn't changed. I was able to run 2-3 times more pods on the same cluster previously. I vaguely remember that the nodes were waiting for a restart after a security/kernel update.

The current value of Capacity and Allocatable:

Capacity:
 attachable-volumes-azure-disk:  4
 cpu:                            2
 ephemeral-storage:              101584140Ki
 hugepages-1Gi:                  0
 hugepages-2Mi:                  0
 memory:                         4017572Ki
 pods:                           110
Allocatable:
 attachable-volumes-azure-disk:  4
 cpu:                            1931m
 ephemeral-storage:              93619943269
 hugepages-1Gi:                  0
 hugepages-2Mi:                  0
 memory:                         2200996Ki
 pods:                           110

Is there any chance the allocatable memory dropped after kernel update (which may have caused eg. aks-engine update?)

Thanks!

@welcome
Copy link

welcome bot commented Nov 27, 2019

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

@ritazh
Copy link
Member

ritazh commented Dec 2, 2019

Looks like this is an AKS cluster. Transferring this issue there in case other users have similar behaviors.

@ritazh ritazh transferred this issue from Azure/aks-engine Dec 2, 2019
@ghost ghost added the triage label Dec 2, 2019
@ondrejhlavacek
Copy link
Author

@ritazh Oh, I'm sorry, thanks!

@ondrejhlavacek
Copy link
Author

possibly related to #1216 and probably not happened during the period i have indicated

@neoGeneva
Copy link

Hey, I'm also having problems due to reduced memory on nodes.

I have two single node clusters on different versions of K8S, both have nodes with 4017088Ki capacity, the v1.10.3 clusters node has 3092416Ki allocatable but the v1.14.8 cluster only has 2200480Ki.

Looks like both the kube reserved and the eviction hard limit has increased, the older cluster has --eviction-hard=memory.available<100Mi --kube-reserved=memory=803Mi and the newer has --eviction-hard=memory.available<750Mi --kube-reserved=memory=1024Mi

As mentioned in #1216 having 45% reserved is pretty restrictive, is there any chance on having these values tweaked for low memory nodes?

@worldspawn
Copy link

worldspawn commented Jun 11, 2020

Its outlined here I believe: https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations

Memory - memory utilized by AKS includes the sum of two values.
The kubelet daemon is installed on all Kubernetes agent nodes to manage container creation and termination. By default on AKS, this daemon has the following eviction rule: memory.available<750Mi, which means a node must always have at least 750 Mi allocatable at all times. When a host is below that threshold of available memory, the kubelet will terminate one of the running pods to free memory on the host machine and protect it. This is a reactive action once available memory decreases beyond the 750Mi threshold.

The second value is a regressive rate of memory reservations for the kubelet daemon to properly function (kube-reserved).

25% of the first 4 GB of memory
20% of the next 4 GB of memory (up to 8 GB)
10% of the next 8 GB of memory (up to 16 GB)
6% of the next 112 GB of memory (up to 128 GB)
2% of any memory above 128 GB

Seems a bit obscene really. Certainly something to factor in when weighing the real comparative costs between node sizes.

@github-actions
Copy link

Action required from @Azure/aks-pm

@ghost ghost removed the triage label Jul 21, 2020
@ghost
Copy link

ghost commented Jul 26, 2020

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020
@palma21
Copy link
Member

palma21 commented Jul 27, 2020

Dropping the same info from the linked issue. I will leave this issue open as the feature request for less aggressive reservations on User pools.

In terms of explanation why the difference I can't speak by the other cloud providers as I don't have visibility on the workloads and customers there, AKS is fairly conservative in what regards to protecting the cluster against "rogue" or misbehaved workloads, which have caused a lot of issues in the past when workloads can race faster for resources than even cgroups and slices can account for and we needed a larger buffer.

This, we acknowledge, can penalize well behaved workloads and users that would otherwise benefit from quite lenient default reservations. We didn't took this decision lightly but on the account of hundreds of cases where we saw these issues, and so far this has been a trade-off of running in this managed service scenario.

Nonetheless we're working on providing:

  1. The ability to have lower reservation on User Pools vs System pools. https://docs.microsoft.com/en-us/azure/aks/use-system-pools
  2. Considering the possibility to preview kubelet customizations as asked on the above item. There will always be a support tradeoff on these cases.

Until then, and if you're using multiple nodepools, you can already workaround this by a applying a similar daemon set as below to your User Pools.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    component: ds-reserve
  name: ds-reserve
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: ds-reserve
      tier: node
  template:
    metadata:
      labels:
        component: ds-reserve
        tier: node
    spec:
      containers:
      - command:
        - nsenter
        - --target
        - "1"
        - --mount
        - --uts
        - --ipc
        - --net
        - --pid
        - --
        - sh
        - -c
        - |
          sed -i 's/--kube-reserved=\S*/--kube-reserved=cpu=100m,memory=897Mi/' /etc/default/kubelet
          sed -i 's/--eviction-hard=\S*/--eviction-hard=memory.available<100Mi/' /etc/default/kubelet
          systemctl daemon-reload
          systemctl restart kubelet
          while true; do sleep 100000; done
        image: alpine
        imagePullPolicy: IfNotPresent
        name: ds-reserve
        resources:
          requests:
            cpu: 10m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
      nodeSelector:
        kubernetes.azure.com/mode: user
  updateStrategy:
    type: RollingUpdate

@palma21 palma21 changed the title Allocatable memory probably suddenly decreased Provide more flexible resource reservations for User Pools Jul 27, 2020
@palma21 palma21 changed the title Provide more flexible resource reservations for User Pools Provide more flexible resource reservations for User Node Pools Jul 27, 2020
@palma21 palma21 added feature-request Requested Features nodepools nodepools/mode and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Jul 27, 2020
@palma21 palma21 self-assigned this Jul 27, 2020
@ghost ghost added the action-required label Jan 23, 2021
@ghost ghost added the action-required label Feb 28, 2022
@nwmcsween
Copy link

@palma21 will the daemonset change result in an unsupported cluster as per the shared responsibilites doc? The limits really need to get revised or priced into the VMs as 34% loss of ram is severe.

@kaarthis
Copy link
Contributor

@stl327 to comment on this and own it.

@RooMaiku RooMaiku added the addon/scaling Handling req/limit settings for AKS managed addon pods label Apr 6, 2023
@stl327
Copy link
Contributor

stl327 commented Nov 6, 2023

AKS has released updated logic to our memory reservations for kube-reserved and the eviction threshold. These optimizations will increase the allocatable space for application workloads by up to 20%. Currently this applies for AKS 1.28. For more information, please see: https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addon/scaling Handling req/limit settings for AKS managed addon pods feature-request Requested Features nodepools/mode nodepools
Projects
Development

No branches or pull requests

10 participants