Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
We are trying to install the operator on OKD, but we get this error:
{"level":"error","ts":"2025-05-19T13:49:04Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"a535d3ea-ebc7-4c22-9e62-7c372c6814c0","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 39.20240210.3.0: ERROR: failed to get destination directory for custom repo config: distribution not supported"}
We have an air-gapped environment, so trying to use repo config option:
repoConfig:
configMapName: repo-config
But we noticed that "fedora" is missing from the Map:
gpu-operator/internal/state/driver_volumes.go
Lines 33 to 39 in 349cf4f
Details:
[gpu-operator@gpu-operator-6ffdc677f6-92828 /]$ cat /host-etc/os-release
NAME="Fedora Linux"
VERSION="39.20240210.3.0 (CoreOS)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora CoreOS 39.20240210.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='39.20240210.3.0'
The ID=fedora
Is it a bug, or on purpose? Is OKD fedora supported?
To Reproduce
Install the operator on an air-gapped environment with custom repo config.,
Expected behavior
Successful install on air-gapped OKD.
Environment (please provide the following information):
- GPU Operator Version: 25.3.0
- OS: [e.g. Ubuntu24.04]
- Kernel Version: 6.7.4-200.fc39.x86_64
- Container Runtime Version: v1.28.7+6e2789b (crio)
- Kubernetes Distro and Version: OKD - Cluster version is 4.15.0-0.okd-2024-03-10-010116
Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-4xc4h 0/1 Init:0/1 0 4m4s
gpu-operator-6ffdc677f6-92828 1/1 Running 0 53m
nvidia-container-toolkit-daemonset-mn5mj 0/1 Init:0/1 0 4m5s
nvidia-dcgm-exporter-86njk 0/1 Init:0/1 0 4m5s
nvidia-dcgm-qbjfc 0/1 Init:0/1 0 4m5s
nvidia-device-plugin-daemonset-cv24p 0/1 Init:0/1 0 4m5s
nvidia-driver-daemonset-39.20240210.3.0-88f4n 1/2 CrashLoopBackOff 24 (4m11s ago) 108m
nvidia-driver-daemonset-392024021030-88f4n-debug-mrq2c 2/2 Running 0 4m37s
nvidia-node-status-exporter-8kk4n 1/1 Running 0 108m
nvidia-operator-validator-mvr4n 0/1 Init:0/4 0 4m5s
- kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 3d3h
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 3d3h
nvidia-dcgm 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm=true 3d3h
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 3d3h
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 3d3h
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 3d3h
nvidia-driver-daemonset-39.20240210.3.0 1 1 0 1 0 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=39.20240210.3.0,nvidia.com/gpu.deploy.driver=true 3d3h
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 3d3h
nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 3d3h
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 3d3h
- If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
- If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
- Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
- containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
Thanks a lot.