Skip to content

calico-node fails DNS lookup on startup when KUBERNETES_SERVICE_HOST is a domain and dataplane is BPF #10683

@Mmx233

Description

@Mmx233

When a node is created or rebooted, and BPF dataplane is enabled, the calico-node pod attempts to resolve KUBERNETES_SERVICE_HOST during its startup phase. However, at this point, the node has not yet established connectivity to other nodes—including access to CoreDNS—so the DNS lookup fails, causing calico-node to hang.

When deploying Calico via the Tigera Operator, the calico-node pod is hardcoded with:

  • dnsPolicy: ClusterFirstWithHostNet
  • No way to override or customize dnsConfig

This setup results in a deadlock during startup, as calico-node cannot resolve the Kubernetes API server domain via DNS and therefore never completes initialization.

Expected Behavior

calico-node should initialize the dataplane before attempting to perform any DNS lookups, ensuring it has network connectivity (e.g., to CoreDNS) when needed.

OR

The Tigera Operator should allow users to customize the dnsConfig of the calico-node Pod, so that it can be configured to use a reliable DNS resolver as fallback.

Current Behavior

The install-cni container fails to complete successfully in the current environment due to the following error:

CrashLoopBackOff (back-off 1m20s restarting failed container=install-cni pod=calico-node-fzfrr_calico-system(b0cbeec3-d785-4388-ac40-cdf0b32e846b))

...
2025-07-17 06:08:12.485 [INFO][1] cni-installer/install.go 234: CNI plugin version: v3.30.2
2025-07-17 06:08:12.485 [INFO][1] cni-installer/install.go 186: /host/secondary-bin-dir is not writeable, skipping
2025-07-17 06:08:12.485 [INFO][1] cni-installer/winutils.go 149: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.
2025-07-17 06:08:42.491 [ERROR][1] cni-installer/token_watch.go 108: Unable to create token for CNI kubeconfig error=Post "https://example.com:6443/api/v1/namespaces/calico-system/serviceaccounts/calico-cni-plugin/token": dial tcp: lookup example.com: i/o timeout
2025-07-17 06:08:42.491 [FATAL][1] cni-installer/install.go 499: Unable to create token for CNI kubeconfig error=Post "https://example.com:6443/api/v1/namespaces/calico-system/serviceaccounts/calico-cni-plugin/token": dial tcp: lookup example.com: i/o timeout

I attempted to configure calico-node's dnsConfig via Tigera Operator by setting the following values.yaml:

kubernetesServiceEndpoint:
  host: "example.com"
  port: "6443"
installation:
  calicoNodeDaemonSet:
    spec:
      template:
        spec:
          dnsConfig:
            nameservers:
              - cluster-dns
              - node-dns-for-fallback

However, this configuration had no effect, the custom dnsConfig was not applied to calico-node pods:

kubectl get daemonsets calico-node -n calico-system -o yaml

...
      dnsConfig:
        nameservers:
        - 10.104.0.10
        - 2001:cafe:104::a
      dnsPolicy: ClusterFirstWithHostNet
...

Possible Solution

Manually setting a static entry for the Kubernetes API server domain in the node’s /etc/hosts file can temporarily mitigate the issue. However, this approach is not sustainable in dynamic environments where the control plane IP might change , as it requires manual updates on every node.

Deploying NodeLocal DNSCache may help mitigate the issue by providing local DNS resolution before full network connectivity is established. However, in my testing, although calico-node may could eventually able to resolve the correct domain name, it would first attempt to resolve the search domain, and the initial lookup would hit the timeout deadline, causing the pod to fail startup.

Steps to Reproduce (for bugs)

Installing Tigera Operator using the following values.yaml:

kubernetesServiceEndpoint:
  host: "example.com"
  port: "6443"
installation:
  controlPlaneTolerations:
    - key: node.kubernetes.io/network-unavailable
      operator: Exists
  cni:
    type: Calico
  calicoNetwork:
    bgp: Disabled
    containerIPForwarding: Enabled
    linuxDataplane: BPF
    ipPools:
      - blockSize: 26
        cidr: 10.103.0.0/16
        encapsulation: VXLANCrossSubnet
        name: ipv4-ippool
        natOutgoing: Enabled
        nodeSelector: all()
      - blockSize: 122
        cidr: 2001:cafe:103::/56
        encapsulation: VXLANCrossSubnet
        name: ipv6-ippool
        natOutgoing: Enabled
        nodeSelector: all()
  serviceCIDRs:
    - 10.104.0.0/16
    - 2001:cafe:104::00/112

Context

Affects node addition, replacement, and reboot scenarios.

Your Environment

  • Calico version: v3.30.2
  • Calico dataplane (iptables, windows etc.): BPF
  • Orchestrator version (e.g. kubernetes, mesos, rkt): v1.32.5 +k3s1
  • Operating System and version: Ubuntu 24.04.2 LTS
  • Link to your project (optional):

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions