Skip to content

[Feature Request] Option to skip specific validators #1460

Open
@chiragjn

Description

@chiragjn

Background:

We are using gpu-operator on GKE (COS) that already comes with toolkit and device plugin installed.
I have gone through this page: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html

When we let the gpu operator install and manage the toolkit we are facing instability issues in containerd (frequent restarts) and image pull errors - This is a separate issue of its own - Maybe this is due to COS being restrictive and it does not like configurations being changed.

Hence we are setting toolkit.enabled and devicePlugin.enabled as false to use whatever GKE already provides us.

The only problem is toolkit-validation in operator-validator errors out.

Root cause:

toolkit-validation runs with securityContext: privileged and asks for gpus using NVIDIA_VISIBLE_DEVICES: all.
Normally toolkit + device plugin honor this and inject nvidia-smi and the gpus.

But GKE's toolkit does not inject nvidia-smi until the container explicitly requests nvidia.com/gpu > 0

We would like the option to skip toolkit validation or any validation component

Full set of values here: https://github.com/truefoundry/infra-charts/blob/2c3b765c57acb36fd6cf0a4b7c4f8da47d0ed15b/charts/tfy-gpu-operator/values.yaml#L516-L888

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions