Skip to content

gpu-operator daemonsets are terminated when the NFD device labels include PCI device IDs #1469

Open
@LaVLaS

Description

@LaVLaS

Currently the gpu-operator relies on the Node Feature Discovery operator to apply thefeature.node.kubernetes.io/pci-10de.present=true node label for any nodes that contain NVIDIA hardware.
If the gpu-operator daemonsets are already deployed and the cluster admin modifies the sources.pci.deviceLabelFields property to add the device PCI ID in the NodeFeatureDiscovery CR, this will cause an immediate termination of the gpu-operator stack.

apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: openshift-nfd
  namespace: openshift-nfd
spec:
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        pci:
          deviceClassWhitelist:
            - "0200"
            - "03"
            - "12"
          deviceLabelFields:
            - "vendor"
            - "device"         # <----This blocks OR terminates the NVIDIA gpu-operators stack

Since the NodeFeatureDiscovery is a shared global resource it would be a best practice to bring the ownership of gpu-operator driver rollout into a custom node label applied by a NodeFeatureRule.

You already apply the node label nvidia.com/gpu.present=true after feature.node.kubernetes.io/pci-10de.present=true is detected could nvidia.com/gpu.present=true be elevated to a NodeFeatureRule that is applied by the gpu-operator automatically after installation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions