Skip to content

Karpenter OutOfMemory frequency hike when several NodeClaims are in Unknown status #2358

@kunhwiko

Description

@kunhwiko

Description

Observed Behavior:
We noticed during burst scheduling, Karpenter's memory usage will abnormally spike at first but will settle down as instances get provisioned. In our situation, we are bursting from 20 --> 1500 nodes and 800 --> 18000 pods. Our Karpenter is provisioned with 12Gi of memory requests/limits.

When turning on memory profiling, the memory spike led to the following flow:
Image

During the initial phase of scheduling, hundreds of NodeClaim objects will initially remain in an Unknown state. From my understanding, Karpenter will continuously look for NodeClaims that aren't fully initialized and cache the full NodeClaim objects for up to 1 minute (which was recently further bumped to 1 hour):

Since the full NodeClaim object is cached for each NodeClaim, the size of the cache can grow quite quickly and cause Karpenter to crash unless the memory size is larger than usual. When Karpenter recovers, the controller's cache will be refreshed, will likely loop through some of the NodeClaims that were already processed, be re-cached, and the cycle repeats after Karpenter crashes again. Several NodeClaims may never get the chance to be processed for quite some time, although eventually they will get processed as other NodeClaims start to initialize.

As the cache depletes, we then see Karpenter's memory start to stabilize.


Expected Behavior:
I might be misunderstanding the code, and will be happy to be corrected! Some observations:

  • I am wondering if it is feasible to cache parts of the NodeClaim rather than the full object as this seems to fill up memory quite quickly.
  • I am also wondering if it is feasible to evict fully initialized NodeClaims from the cache now that the upcoming Karpenter will store objects for up to 1 hour.

I understand this might not be feasible if it brings unnecessary complexity. Alternatively, I am wondering:

  • For us, burst scheduling is awkward in that memory needs to be bumped up at the time of scheduling because of the cache size but can almost immediately be bumped down after all of the nodes have been initialized. Have others experienced this issue and how did they seem to get around it? (e.g. VPA)
  • Would there happen to be benchmarks by the Karpenter team against scheduling several instances at once?

Reproduction Steps (Please include YAML):
This really can be any nodepool/ec2nodeclass. We opted for a relatively small instance size requirement on the nodepool.

We can use simple pause pods that will schedule to different nodes and just scale a bunch of them (e.g. 1500) at the same time:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        fsGroup: 2000
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: inflate 
              topologyKey: "kubernetes.io/hostname"
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            limits:
              memory: 512Mi
            requests:
              cpu: 100m
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false

Versions:

  • Chart Version: v1.3.3
  • Kubernetes Version (kubectl version): v1.32
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-prioritytriage/acceptedIndicates an issue or PR is ready to be actively worked on.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions