-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
When a node is created or rebooted, and BPF dataplane is enabled, the calico-node
pod attempts to resolve KUBERNETES_SERVICE_HOST
during its startup phase. However, at this point, the node has not yet established connectivity to other nodes—including access to CoreDNS—so the DNS lookup fails, causing calico-node
to hang.
When deploying Calico via the Tigera Operator, the calico-node
pod is hardcoded with:
dnsPolicy: ClusterFirstWithHostNet
- No way to override or customize
dnsConfig
This setup results in a deadlock during startup, as calico-node
cannot resolve the Kubernetes API server domain via DNS and therefore never completes initialization.
Expected Behavior
calico-node
should initialize the dataplane before attempting to perform any DNS lookups, ensuring it has network connectivity (e.g., to CoreDNS) when needed.
OR
The Tigera Operator should allow users to customize the dnsConfig
of the calico-node Pod, so that it can be configured to use a reliable DNS resolver as fallback.
Current Behavior
The install-cni container fails to complete successfully in the current environment due to the following error:
CrashLoopBackOff (back-off 1m20s restarting failed container=install-cni pod=calico-node-fzfrr_calico-system(b0cbeec3-d785-4388-ac40-cdf0b32e846b))
...
2025-07-17 06:08:12.485 [INFO][1] cni-installer/install.go 234: CNI plugin version: v3.30.2
2025-07-17 06:08:12.485 [INFO][1] cni-installer/install.go 186: /host/secondary-bin-dir is not writeable, skipping
2025-07-17 06:08:12.485 [INFO][1] cni-installer/winutils.go 149: Neither --kubeconfig nor --master was specified. Using the inClusterConfig.
2025-07-17 06:08:42.491 [ERROR][1] cni-installer/token_watch.go 108: Unable to create token for CNI kubeconfig error=Post "https://example.com:6443/api/v1/namespaces/calico-system/serviceaccounts/calico-cni-plugin/token": dial tcp: lookup example.com: i/o timeout
2025-07-17 06:08:42.491 [FATAL][1] cni-installer/install.go 499: Unable to create token for CNI kubeconfig error=Post "https://example.com:6443/api/v1/namespaces/calico-system/serviceaccounts/calico-cni-plugin/token": dial tcp: lookup example.com: i/o timeout
I attempted to configure calico-node
's dnsConfig
via Tigera Operator by setting the following values.yaml
:
kubernetesServiceEndpoint:
host: "example.com"
port: "6443"
installation:
calicoNodeDaemonSet:
spec:
template:
spec:
dnsConfig:
nameservers:
- cluster-dns
- node-dns-for-fallback
However, this configuration had no effect, the custom dnsConfig
was not applied to calico-node
pods:
kubectl get daemonsets calico-node -n calico-system -o yaml
...
dnsConfig:
nameservers:
- 10.104.0.10
- 2001:cafe:104::a
dnsPolicy: ClusterFirstWithHostNet
...
Possible Solution
Manually setting a static entry for the Kubernetes API server domain in the node’s /etc/hosts
file can temporarily mitigate the issue. However, this approach is not sustainable in dynamic environments where the control plane IP might change , as it requires manual updates on every node.
Deploying NodeLocal DNSCache may help mitigate the issue by providing local DNS resolution before full network connectivity is established. However, in my testing, although calico-node
may could eventually able to resolve the correct domain name, it would first attempt to resolve the search domain, and the initial lookup would hit the timeout deadline, causing the pod to fail startup.
Steps to Reproduce (for bugs)
Installing Tigera Operator using the following values.yaml
:
kubernetesServiceEndpoint:
host: "example.com"
port: "6443"
installation:
controlPlaneTolerations:
- key: node.kubernetes.io/network-unavailable
operator: Exists
cni:
type: Calico
calicoNetwork:
bgp: Disabled
containerIPForwarding: Enabled
linuxDataplane: BPF
ipPools:
- blockSize: 26
cidr: 10.103.0.0/16
encapsulation: VXLANCrossSubnet
name: ipv4-ippool
natOutgoing: Enabled
nodeSelector: all()
- blockSize: 122
cidr: 2001:cafe:103::/56
encapsulation: VXLANCrossSubnet
name: ipv6-ippool
natOutgoing: Enabled
nodeSelector: all()
serviceCIDRs:
- 10.104.0.0/16
- 2001:cafe:104::00/112
Context
Affects node addition, replacement, and reboot scenarios.
Your Environment
- Calico version:
v3.30.2
- Calico dataplane (iptables, windows etc.):
BPF
- Orchestrator version (e.g. kubernetes, mesos, rkt):
v1.32.5 +k3s1
- Operating System and version:
Ubuntu 24.04.2 LTS
- Link to your project (optional):