gpu nodegroup may cant trigger scale-up from 0 #8123

suqinglee · 2025-05-13T06:58:44Z

focus this code (cluster-autoscaler-1.26.6)

assume nvdp cant start up, may be image not found or etc. then a gpu node come in nodegroup, p.nodeInfoCache will cache a node without nvidia.com/gpu; and this moment trigger scaledown to 0, this cache item still exist in cluster-autoscaler

when next scale-up triggered, even now nvdp is ok, due to this cache item, cant trigger scale-up, describe the pending pod will see:

The text was updated successfully, but these errors were encountered:

suqinglee · 2025-05-13T07:10:46Z

may same as #5278

adrianmoisey · 2025-05-13T09:59:38Z

/area cluster-autoscaler

chansuke · 2025-05-13T16:34:31Z

/assign

suqinglee changed the title ~~gpu ndoegroup may cant trigger scale-up from 0~~ gpu nodegroup may cant trigger scale-up from 0 May 13, 2025

k8s-ci-robot added the area/cluster-autoscaler label May 13, 2025

k8s-ci-robot assigned chansuke May 13, 2025

chansuke linked a pull request May 18, 2025 that will close this issue

Implement cache validation #8140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu nodegroup may cant trigger scale-up from 0 #8123

gpu nodegroup may cant trigger scale-up from 0 #8123

suqinglee commented May 13, 2025

suqinglee commented May 13, 2025

adrianmoisey commented May 13, 2025

chansuke commented May 13, 2025

gpu nodegroup may cant trigger scale-up from 0 #8123

gpu nodegroup may cant trigger scale-up from 0 #8123

Comments

suqinglee commented May 13, 2025

suqinglee commented May 13, 2025

adrianmoisey commented May 13, 2025

chansuke commented May 13, 2025