Skip to content

Topology updater is failing to collect NUMA information #2145

Open
@dittops

Description

@dittops

What happened:

I have installed the 0.17.3 version of nfd using helm. I want to get the numa node topology, so I enabled the topology updater while installing. But numa details was not added in the label. I have multiple numa while checking with lscpu. Here is the log

sdp@fl4u42:~$ kubectl logs -f nfd-node-feature-discovery-topology-updater-g8wl7
I0430 12:06:36.208275       1 nfd-topology-updater.go:163] "Node Feature Discovery Topology Updater" version="v0.17.3" nodeName="fl4u42"
I0430 12:06:36.208337       1 component.go:34] [core]original dial target is: "/host-var/lib/kubelet-podresources/kubelet.sock"
I0430 12:06:36.208357       1 component.go:34] [core][Channel #1]Channel created
I0430 12:06:36.208371       1 component.go:34] [core][Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"//host-var/lib/kubelet-podresources/kubelet.sock", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
I0430 12:06:36.208375       1 component.go:34] [core][Channel #1]Channel authority set to "%2Fhost-var%2Flib%2Fkubelet-podresources%2Fkubelet.sock"
I0430 12:06:36.208511       1 component.go:34] [core][Channel #1]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "/host-var/lib/kubelet-podresources/kubelet.sock",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "/host-var/lib/kubelet-podresources/kubelet.sock",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
I0430 12:06:36.208535       1 component.go:34] [core][Channel #1]Channel switches to new LB policy "pick_first"
I0430 12:06:36.208562       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel created
I0430 12:06:36.208569       1 component.go:34] [core][Channel #1]Channel Connectivity change to CONNECTING
I0430 12:06:36.208577       1 component.go:34] [core][Channel #1]Channel exiting idle mode
2025/04/30 12:06:36 Connected to '"/host-var/lib/kubelet-podresources/kubelet.sock"'!
I0430 12:06:36.208679       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
I0430 12:06:36.208720       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel picks a new address "/host-var/lib/kubelet-podresources/kubelet.sock" to connect
I0430 12:06:36.208987       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel Connectivity change to READY
I0430 12:06:36.209010       1 component.go:34] [core][Channel #1]Channel Connectivity change to READY
I0430 12:06:36.209018       1 nfd-topology-updater.go:375] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-topology-updater.conf" config={"ExcludeList":null}
I0430 12:06:36.209061       1 podresourcesscanner.go:53] "watching all namespaces"
WARNING: failed to read int from file: open /host-sys/devices/system/node/node0/cpu0/online: no such file or directory
I0430 12:06:36.209247       1 metrics.go:44] "metrics server starting" port=":8081"
I0430 12:06:36.267613       1 component.go:34] [core][Server #4]Server created
I0430 12:06:36.267645       1 nfd-topology-updater.go:145] "gRPC health server serving" port=8082
I0430 12:06:36.267690       1 component.go:34] [core][Server #4 ListenSocket #5]ListenSocket created
I0430 12:07:36.217041       1 podresourcesscanner.go:137] "podFingerprint calculated" status=<
        > processing node ""
        > processing 15 pods
        + aibrix-system/aibrix-kuberay-operator-55f5ddcbf4-vqrwb
        + default/nfd-node-feature-discovery-worker-w5cvn
        + aibrix-system/aibrix-redis-master-7bff9b56f5-hs5k4
        + envoy-gateway-system/envoy-gateway-5bfc954ffc-k4tf7
        + kube-system/metrics-server-5985cbc9d7-vh9pb
        + aibrix-system/aibrix-controller-manager-6489d5b587-hj2bt
        + aibrix-system/aibrix-gateway-plugins-58bdc89d9c-q67pp
        + envoy-gateway-system/envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh
        + kube-system/helm-install-traefik-crd-kz6kg
        + default/nfd-node-feature-discovery-topology-updater-g8wl7
        + kube-system/svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4
        + aibrix-system/aibrix-gpu-optimizer-75df97858d-5zb5s
        + kube-system/helm-install-traefik-j89k5
        + aibrix-system/aibrix-metadata-service-66f45c85bc-k8pzx
        + kube-system/local-path-provisioner-5cf85fd84d-hgf67
        = pfp0v0011be09f6ff65dbfe0
 >
I0430 12:07:36.217093       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-kuberay-operator-55f5ddcbf4-vqrwb"
I0430 12:07:36.217115       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-kuberay-operator-55f5ddcbf4-vqrwb"
I0430 12:07:36.223315       1 podresourcesscanner.go:148] "scanning pod" podName="nfd-node-feature-discovery-worker-w5cvn"
I0430 12:07:36.223325       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-node-feature-discovery-worker-w5cvn"
I0430 12:07:36.225915       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-redis-master-7bff9b56f5-hs5k4"
I0430 12:07:36.225935       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-redis-master-7bff9b56f5-hs5k4"
I0430 12:07:36.228169       1 podresourcesscanner.go:148] "scanning pod" podName="envoy-gateway-5bfc954ffc-k4tf7"
I0430 12:07:36.228195       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="envoy-gateway-5bfc954ffc-k4tf7"
I0430 12:07:36.231774       1 podresourcesscanner.go:148] "scanning pod" podName="metrics-server-5985cbc9d7-vh9pb"
I0430 12:07:36.231788       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="metrics-server-5985cbc9d7-vh9pb"
I0430 12:07:36.233367       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-controller-manager-6489d5b587-hj2bt"
I0430 12:07:36.233374       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-controller-manager-6489d5b587-hj2bt"
I0430 12:07:36.234769       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-gateway-plugins-58bdc89d9c-q67pp"
I0430 12:07:36.234779       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-gateway-plugins-58bdc89d9c-q67pp"
I0430 12:07:36.236354       1 podresourcesscanner.go:148] "scanning pod" podName="envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh"
I0430 12:07:36.236361       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh"
I0430 12:07:36.238011       1 podresourcesscanner.go:148] "scanning pod" podName="helm-install-traefik-crd-kz6kg"
I0430 12:07:36.238017       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="helm-install-traefik-crd-kz6kg"
I0430 12:07:36.239514       1 podresourcesscanner.go:148] "scanning pod" podName="nfd-node-feature-discovery-topology-updater-g8wl7"
I0430 12:07:36.239521       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-node-feature-discovery-topology-updater-g8wl7"
I0430 12:07:36.241754       1 podresourcesscanner.go:148] "scanning pod" podName="svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4"
I0430 12:07:36.241760       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4"
I0430 12:07:36.422134       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-gpu-optimizer-75df97858d-5zb5s"
I0430 12:07:36.422165       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-gpu-optimizer-75df97858d-5zb5s"
I0430 12:07:36.621889       1 podresourcesscanner.go:148] "scanning pod" podName="helm-install-traefik-j89k5"
I0430 12:07:36.621923       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="helm-install-traefik-j89k5"
I0430 12:07:36.821266       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-metadata-service-66f45c85bc-k8pzx"
I0430 12:07:36.821294       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-metadata-service-66f45c85bc-k8pzx"
I0430 12:07:37.022025       1 podresourcesscanner.go:148] "scanning pod" podName="local-path-provisioner-5cf85fd84d-hgf67"
I0430 12:07:37.022057       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="local-path-provisioner-5cf85fd84d-hgf67"
I0430 12:07:37.432143       1 metrics.go:51] "stopping metrics server" port=":8081"
I0430 12:07:37.432207       1 metrics.go:45] "metrics server stopped" exitCode="http: Server closed"
E0430 12:07:37.432223       1 main.go:66] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"

lscpu snippet

NUMA:                    
  NUMA node(s):          8
  NUMA node0 CPU(s):     0-13,112-125
  NUMA node1 CPU(s):     14-27,126-139
  NUMA node2 CPU(s):     28-41,140-153
  NUMA node3 CPU(s):     42-55,154-167
  NUMA node4 CPU(s):     56-69,168-181
  NUMA node5 CPU(s):     70-83,182-195
  NUMA node6 CPU(s):     84-97,196-209
  NUMA node7 CPU(s):     98-111,210-223

Environment:

  • Kubernetes version (use kubectl version): v1.31.3+k3s1
  • Cloud provider or hardware configuration: Onprem hardware, Intel(R) Xeon(R) Platinum 8480+, 512GB
  • OS (e.g: cat /etc/os-release): Ubuntu 23.04
  • Kernel (e.g. uname -a): 6.2.0-39-generic
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions