Description
Currently the gpu-operator relies on the Node Feature Discovery operator to apply thefeature.node.kubernetes.io/pci-10de.present=true
node label for any nodes that contain NVIDIA hardware.
If the gpu-operator daemonsets are already deployed and the cluster admin modifies the sources.pci.deviceLabelFields
property to add the device PCI ID in the NodeFeatureDiscovery
CR, this will cause an immediate termination of the gpu-operator stack.
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: openshift-nfd
namespace: openshift-nfd
spec:
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
pci:
deviceClassWhitelist:
- "0200"
- "03"
- "12"
deviceLabelFields:
- "vendor"
- "device" # <----This blocks OR terminates the NVIDIA gpu-operators stack
Since the NodeFeatureDiscovery
is a shared global resource it would be a best practice to bring the ownership of gpu-operator driver rollout into a custom node label applied by a NodeFeatureRule
.
You already apply the node label nvidia.com/gpu.present=true
after feature.node.kubernetes.io/pci-10de.present=true
is detected could nvidia.com/gpu.present=true
be elevated to a NodeFeatureRule
that is applied by the gpu-operator automatically after installation?