Skip to content

failed to assign an IP address to container after enabling custom networking #3238

Open
@barsilver

Description

@barsilver

What happened:

Every once in a while since I have set the vpc-cni add-on to use custom networking we're facing the following errors, even though there are enough IP addresses in the secondary subnets:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c3b3c54fed84a015bf5a71d5f78ba19ad1ca7f2dcf73eb25fd3ff0dc7ff7c5ed": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

When running the following command on this node:
kubectl describe node ip-172-32-1-101.ec2.internal | grep 'pods\|PrivateIPv4Address'

We see that the node's capacity shows:

pods:                    58

However, when running:
./max-pods-calculator.sh --instance-type c6i.2xlarge --cni-version "1.19.2-eksbuild.1" --cni-custom-networking-enabled
The result shows that the maximum number of pods that should be able to run on this instance type is 44.

In /var/log/user_data.log of the affected node, the following configurations are visible:

INFO: --dns-cluster-ip='10.100.0.10'
2025-03-17T10:51:50+0000 [eks-bootstrap] INFO: --use-max-pods='false'
2025-03-17T10:51:50+0000 [eks-bootstrap] INFO: --kubelet-extra-args='--node-labels="component=entitle,karpenter.k8s.aws/ec2nodeclass=default,karpenter.sh/capacity-type=spot,karpenter.sh/nodepool=spot" --register-with-taints="karpenter.sh/unregistered:NoExecute" --max-pods=58'
2025-03-17T10:51:50+0000 [eks-bootstrap] INFO: Using kubelet version 1.31.5
2025-03-17T10:51:50+0000 [eks-bootstrap] INFO: Using containerd as the container runtime
Created symlink from /etc/systemd/system/multi-user.target.wants/sys-fs-bpf.mount to /etc/systemd/system/sys-fs-bpf.mount.
2025-03-17T10:51:51+0000 [eks-bootstrap] INFO: Using IP family: ipv4

The ipamd.log show errors like these:

{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:607","msg":"AssignPodIPv4Address: IP address pool stats: total 42, assigned 42"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:607","msg":"AssignPodIPv4Address: ENI eni-04ddad5f92297dbb6 does not have available addresses"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 100.64.220.11/ffffffff"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 100.64.220.11/ffffffff"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 100.64.130.44/ffffffff"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 100.64.130.44/ffffffff"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 100.64.139.212/ffffffff"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 100.64.139.212/ffffffff"}
{"level":"debug","ts":"2025-03-18T07:15:21.077Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 100.64.216.129/ffffffff"}

These errors started after attempting to resolve an issue where IP addresses were exhausted in the subnets the cluster was using. The subnets were too small, and both nodes and pods were using the same CIDR range.
Secondary subnets were added using Terraform with the following configuration:

module "custom-netwotking" {
  source         = "../../modules/vpc-cni-custom-networking"
  cluster_name   = module.eks_cluster.eks_cluster_id
  secondary_cidr = "100.64.0.0/16"
  secondary_subnets = {
    us-east-1a = "100.64.0.0/17"
    us-east-1b = "100.64.128.0/17"
  }
}

The ENIConfig for the newly created subnets looks like this:

apiVersion: v1
items:
- apiVersion: crd.k8s.amazonaws.com/v1alpha1
  kind: ENIConfig
  metadata:
    annotations:
    generation: 1
    name: us-east-1a
  spec:
    securityGroups:
    - sg-07029
    subnet: subnet-075de
- apiVersion: crd.k8s.amazonaws.com/v1alpha1
  kind: ENIConfig
  metadata:
    annotations:
    generation: 1
    name: us-east-1b
  spec:
    securityGroups:
    - sg-07029
    subnet: subnet-095f8
kind: List
metadata:
  resourceVersion: ""

When I tried setting ENABLE_POD_ENI=true and DISABLE_TCP_EARLY_DEMUX=true and then restarted the nodes, I received the following errors while the nodes failed to remove:

  Normal  ControllerVersionNotice  7s (x14 over 49m)    vpc-resource-controller  The node is managed by VPC resource controller version v1.6.3
  Normal  NodeTrunkFailedInit      6s (x14 over 49m)    vpc-resource-controller  The node failed initializing trunk interface

Environment:

  • Node Kubernetes version: v1.31.4-eks-aeac579
  • CNI Version: 1.19.2-eksbuild.1
  • Node AMI: amazon-eks-arm64-node-1.31-v20250123

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions