Skip to content

Missing allocatable memory explanation #1216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cubed-it opened this issue Sep 18, 2019 · 12 comments
Closed

Missing allocatable memory explanation #1216

cubed-it opened this issue Sep 18, 2019 · 12 comments

Comments

@cubed-it
Copy link

What happened:
Describe node prints 4017088Ki capacity and allocatable states 2200512Ki.
So 45% are reserved.

What you expected to happen:
Find a answear how allocatable memory is determined like here https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture?hl=de#memory_cpu

Anything else we need to know?:
I would also like to know why AKS provides nearly 25% less memory on my 4GB VM then GKE?

Environment:

  • Kubernetes version (use kubectl version): 1.14.6
  • Size of cluster (how many worker nodes are in the cluster?) 3xB2s
@ghost ghost added the triage label Sep 18, 2019
@jpoizat
Copy link

jpoizat commented Sep 27, 2019

I don't know where this is coming from but EKS is much less hungry on memory :
Capacity:
...
memory: 16038616Ki
...
Allocatable:
...
memory: 15219416Ki

versus AKS:
Capacity:
...
memory: 16403296Ki
...
Allocatable:
...
memory: 12909408Ki

@onybo
Copy link

onybo commented Oct 24, 2019

We see similar behaviour as @cubed-it .

From my very limited numbers it could look like using small nodes in AKS results in a huge waste of memory resources 45%.
Also looks like it gets better with node size
All my numbers are from kubetctl describe node <node name>

Cluster 1:
capacity/allocatable: 4016988Ki/2200412Ki
Reserved memory?: 1816576Ki or 45%
version: 1.15.3

Cluster 2:
capacity/allocatable: 8145760Ki/5490528Ki
Reserved memory?: 2655232Ki or 33%
version: 1.15.3

Including @jpoizat's numbers for comparison:
capacity/allocatable: 16403296Ki/12909408Ki
Reserved memory?: 3493888Ki or 21%
version: ?

@jpoizat
Copy link

jpoizat commented Oct 24, 2019

There is a document explaining how it is done :
https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations

but agreed, the memory % reserved on smaller node is high...

@mimckitt
Copy link

Adding @jluk @sauryadas @palma21

@MarcosMMartinez
Copy link

@palma21 - any input here? I have an internal inquiry coming you way soon - it relates to this.

@jluk
Copy link
Contributor

jluk commented Dec 2, 2019

There has since been a v2 addition to the linked document describing the memory reservations.
I believe this should complete this open issue but defer to @MicahMcKittrick-MSFT

Memory - reserved memory includes the sum of two values
The kubelet daemon is installed on all Kubernetes agent nodes to manage container creation and termination. By default on AKS, this daemon has the following eviction rule: memory.available<750Mi, which means a node must always have at least 750 Mi allocatable at all times. When a host is below that threshold of available memory, the kubelet will terminate one of the running pods to free memory on the host machine and protect it.

The second value is a progressive rate of memory reserved for the kubelet daemon to properly function (kube-reserved).

25% of the first 4 GB of memory
20% of the next 4 GB of memory (up to 8 GB)
10% of the next 8 GB of memory (up to 16 GB)
6% of the next 112 GB of memory (up to 128 GB)
2% of any memory above 128 GB

@MarcosMMartinez
Copy link

@jluk, it seems we're asserting the 750Mi isn't allocatable, though? My calculations also show the same.

image

In other words, it seems like we're specifying this value against --kube-reserved:

image

Ref: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable

@Timvissers
Copy link

How comes the big difference with other cloud providers?

@Pamir
Copy link

Pamir commented Apr 12, 2020

Hi @Timvissers

If you can shell into kubernetes worker nodes you will see the actual reservation in terms of MBs.

you should use krew plugin with node-shell, to shell a worker vm

k node-shell xxxx
ps -ef | grep kubelet

see

--kube-reserved=cpu=100m,memory=1638Mi

depending on SKUs in every cloud vendor, it should be documented.

@neoGeneva
Copy link

It looks like the ~25% difference on 4Gi nodes between AKS and GKE is the eviction threshold, it's 750Mi for AKS and 100Mi on GKE (according to the docs anyhow).

I couldn't find the exact numbers for EKS, but it looks like they allow customization of kubelet config, so it's effectively whatever you like. I see there's a feature request for that here too #323, and I think that'd be a good way to allow people to customize the settings for what's appropriate for their low memory setups.

@github-actions
Copy link

Action required from @Azure/aks-pm

@ghost ghost removed the triage label Jul 21, 2020
@palma21
Copy link
Member

palma21 commented Jul 21, 2020

I'm closing this issue as the doc clarification was added to the doc.

In terms of explanation why the difference I can't speak by the other cloud providers as I don't have visibility on the workloads and customers there, AKS is fairly conservative in what regards to protecting the cluster against "rogue" or misbehaved workloads, which have caused a lot of issues in the past when workloads can race faster for resources than even cgroups and slices can account for and we needed a larger buffer.

This, we acknowledge, can penalize well behaved workloads and users that would otherwise benefit from quite lenient default reservations. We didn't took this decision lightly but on the account of hundreds of cases where we saw these issues, and so far this has been a trade-off of running in this managed service scenario.

Nonetheless we're working on providing:

  1. The ability to have lower reservation on User Pools vs System pools. https://docs.microsoft.com/en-us/azure/aks/use-system-pools
  2. Considering the possibility to preview kubelet customizations as asked on the above item. There will always be a support tradeoff on these cases.

Until then, and if you're using multiple nodepools, you can already workaround this by a applying a similar daemon set as below to your User Pools.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    component: ds-reserve
  name: ds-reserve
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: ds-reserve
      tier: node
  template:
    metadata:
      labels:
        component: ds-reserve
        tier: node
    spec:
      containers:
      - command:
        - nsenter
        - --target
        - "1"
        - --mount
        - --uts
        - --ipc
        - --net
        - --pid
        - --
        - sh
        - -c
        - |
          sed -i 's/--kube-reserved=\S*/--kube-reserved=cpu=100m,memory=897Mi/' /etc/default/kubelet
          sed -i 's/--eviction-hard=\S*/--eviction-hard=memory.available<100Mi/' /etc/default/kubelet
          systemctl daemon-reload
          systemctl restart kubelet
          while true; do sleep 100000; done
        image: alpine
        imagePullPolicy: IfNotPresent
        name: ds-reserve
        resources:
          requests:
            cpu: 10m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
      nodeSelector:
        kubernetes.azure.com/mode: user
  updateStrategy:
    type: RollingUpdate

@palma21 palma21 closed this as completed Jul 21, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants