Description
Background:
We are using gpu-operator on GKE (COS) that already comes with toolkit and device plugin installed.
I have gone through this page: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html
When we let the gpu operator install and manage the toolkit we are facing instability issues in containerd (frequent restarts) and image pull errors - This is a separate issue of its own - Maybe this is due to COS being restrictive and it does not like configurations being changed.
Hence we are setting toolkit.enabled
and devicePlugin.enabled
as false
to use whatever GKE already provides us.
The only problem is toolkit-validation
in operator-validator
errors out.
Root cause:
toolkit-validation
runs with securityContext: privileged
and asks for gpus using NVIDIA_VISIBLE_DEVICES: all
.
Normally toolkit + device plugin honor this and inject nvidia-smi
and the gpus.
But GKE's toolkit does not inject nvidia-smi
until the container explicitly requests nvidia.com/gpu > 0
We would like the option to skip toolkit validation or any validation component
Full set of values here: https://github.com/truefoundry/infra-charts/blob/2c3b765c57acb36fd6cf0a4b7c4f8da47d0ed15b/charts/tfy-gpu-operator/values.yaml#L516-L888