You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 14, 2023. It is now read-only.
The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:
⚠️ Deprecation Notice
As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).
I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.
I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Hi @chrisroat! The GKE issue was closed recently. Are you still facing any problems with node shutdowns so you still need the node termination handler?
I no longer maintain the (closed-source) project that was hitting the issue. We had forked this repo to add the ability to handle spot instances.
@erichamc would be able to test, though I don't think it would be high priority to check. For reference, the symptom was that the cluster's gpu workloads would not restart properly after node preemptions. Over time, a cluster might -- if it had enough preemptions to trigger the issue -- show failing workloads unable to find the nvidia libraries. [@erichamc -dropping the termination handler would amount to dropping the null_resource stanzas in infrastructure/apps/k8s/kubectl.tf]
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:
I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.
I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.
I filed the following GKE issue.
https://issuetracker.google.com/issues/192809336
For the moment, I would ask that this repo not be deprecated.
The text was updated successfully, but these errors were encountered: