gpu-operator daemonsets are terminated when the NFD device labels include PCI device IDs

Currently the gpu-operator relies on the Node Feature Discovery operator to apply the`feature.node.kubernetes.io/pci-10de.present=true` node label for any nodes that contain NVIDIA hardware.  
If the gpu-operator daemonsets are already deployed and the cluster admin modifies the `sources.pci.deviceLabelFields` property to add the device PCI ID in the `NodeFeatureDiscovery` CR, this will cause an immediate termination of the gpu-operator stack.  

```
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: openshift-nfd
  namespace: openshift-nfd
spec:
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        pci:
          deviceClassWhitelist:
            - "0200"
            - "03"
            - "12"
          deviceLabelFields:
            - "vendor"
            - "device"         # <----This blocks OR terminates the NVIDIA gpu-operators stack
```


Since the `NodeFeatureDiscovery` is a shared global resource it would be a best practice to bring the ownership of gpu-operator driver rollout into a custom node label applied by a `NodeFeatureRule`.  

You already apply the node label `nvidia.com/gpu.present=true` after `feature.node.kubernetes.io/pci-10de.present=true` is detected could `nvidia.com/gpu.present=true` be elevated to a `NodeFeatureRule` that is applied by the gpu-operator automatically after installation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu-operator daemonsets are terminated when the NFD device labels include PCI device IDs #1469

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gpu-operator daemonsets are terminated when the NFD device labels include PCI device IDs #1469

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions