Skip to content
This repository was archived by the owner on Mar 14, 2023. It is now read-only.
This repository was archived by the owner on Mar 14, 2023. It is now read-only.

Node Termination handler may still be necessary #43

Open
@chrisroat

Description

@chrisroat

The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:

⚠️ Deprecation Notice
As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).

I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.

I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I filed the following GKE issue.
https://issuetracker.google.com/issues/192809336

For the moment, I would ask that this repo not be deprecated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions