Description
Problem
I am developing a Kubernetes Operator that deploys workloads that use vGPUs. A common error our users hit is that they have misconfigured licensing for the GPU Operator or have run out of seats, and but they don't know that and can't immediately diagnose it.
Today, I have to do some awkward operational steps to discover this information, like run nvidia-smi
and parse its output. This can be particularly difficult in a "getting started" setup where invalid licensing config does not immediately prevent usage – the vGPU works for a little while, and then slows to a crawl.
Proposed solution
I'd like to enhance GPU Operator so that it will surface up-to-date licensing information somewhere in the Kubernetes API.
I could imagine a couple different places that could happen:
-
An annotation on the Kubernetes Node resources, for example a string:
nvidia.com/gpu.0.license-status: "Licensed (Expiry: 2025-6-26 21:46:51 GMT)"
Or with a JSON value
nvidia.com/gpu-license-statuses: '[{ "id": "00000000:02:01.0", "licensed": true, "expiry": "2025-6-26 21:46:51 GMT" }]'
-
Alternately, maybe this could be an element in
status.conditions
on theClusterPolicy
orDriver
Custom Resource? For examplestatus: conditions: - type: Licensed status: "True" reason: LicenseOK message: All GPUs are licensed (Expiry: 2025-6-26 21:46:51 GMT)
These are sketches of API design. The real field names and shapes could be different.
Regardless, a Kubernetes user can now easily discover the licensing status of their vGPUs, and 3rd party controllers can check for this before attempting to launch a vGPU-requesting workload.