NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"

## 1. **Executive summary**

Under specific conditions, it’s possible that containers may be abruptly detached from the GPUs they were initially connected to. We have determined the root cause of this issue and identified the affected environments this can occur in. Workarounds for the affected environments are provided at the end of this document until a proper fix is released.

## 2. **Summary of the issue**

Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a `systemctl daemon-reload`).

When the container loses access to the GPU, you will see the following error message from the console output:

``Failed to initialize NVML: Unknown Error``

The container needs to be deleted once the issue occurs.

When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.

The issue originates from the fact that recent versions of `runc` require that symlinks be present under `/dev/char` to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.

> *A fix will be present in the next patch release of all supported NVIDIA GPU drivers*

## 3. **Affected environments**

Affected environments are those `using runc` and `enabling systemd cgroup management` at the high-level container runtime.

If the system is NOT using `systemd` to manage `cgroups`, then it is NOT subject to this issue.

An exhaustive list of the affected environments is provided below:

-   Docker environment using `containerd` / `runc`:
	- Specific condition:
		- `cgroup` driver enabled with `systemd` (e.g. parameter `"exec-opts": ["native.cgroupdriver=systemd"]` set in `/etc/docker/daemon.json`).
		- Newer docker version is used where `systemd cgroup` management is the default (i.e. on Ubuntu 22.04).

		**Note**:  To check if Docker uses `systemd cgroup` management, run the following command (the output below indicates that `systemd cgroup` driver is enabled) :
		```
		$ docker info  
		...  
		Cgroup Driver: systemd  
		Cgroup Version: 1
		```

-   K8s environment using `containerd` / `runc`:
	-  Specific condition:
		-  `SystemdCgroup = true` in the `containerd` configuration file (usually located here: `/etc/containerd/config.toml`) as shown below:
			```
			[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
			BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
			...
			SystemdCgroup = true
			```
		**Note**:  To check if containerd uses `systemd cgroup` management, issue the following command:  
		```
		$ sudo crictl info  
		...  
		"runtimes": {
		    "nvidia": {
		        "runtimeType": "io.containerd.runc.v2",
		        ...
		        "options": {
		          "BinaryName": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
		          ...
		          "ShimCgroup": "",
		          "SystemdCgroup": true
		```

-   K8s environment (including OpenShift) using `cri-o` / `runc`:
	-  Specific condition:
		-  `cgroup_manager` enabled with `systemd` in the `cri-o` configuration file (usually located here: `/etc/crio/crio.conf` or `/etc/crio/crio.conf.d/00-default`) as shown below (sample with OpenShift):

			```
			[crio.runtime]
			...
			cgroup_manager = "systemd"

			hooks_dir = [
			"/etc/containers/oci/hooks.d",
			"/run/containers/oci/hooks.d",
			"/usr/share/containers/oci/hooks.d",
			]
			```

		**Note**: Podman environments use `crun` by default and are not subject to this issue unless `runc` is configured as the low-level container runtime to be used.

## 4. **How to check if you are affected**

You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.

### For Docker environments

Run a test container:
```
$ docker run -d --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"  

bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
```
**Note**: Make sure to mount the different devices as shown above. They are needed to narrow the problem down to this specific issue.

If your system has more than 1 GPU, append the above command with the additional `--device` mount. Example with a system that has 2 GPUs:
```
$ docker run -d --rm --runtime=nvidia --gpus all \
    ...
    --device=/dev/nvidia0 \
    --device=/dev/nvidia1 \
    ...
```
Check the logs from the container:
```
$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
```
Then initiate a `daemon-reload`:
```
$ sudo systemctl daemon-reload
```
Check the logs from the container:
```
$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error
```

### For K8s environments

Run a test pod:
```
$ cat nvidia-smi-loop.yaml

apiVersion: v1
kind: Pod
metadata:
  name: cuda-nvidia-smi-loop
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda
    image: "nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04"
    command: ["/bin/sh", "-c"]
    args: ["while true; do nvidia-smi -L; sleep 5; done"]
    resources:
      limits:
        nvidia.com/gpu: 1


$ kubectl apply -f nvidia-smi-loop.yaml  
 
pod/cuda-nvidia-smi-loop created
```

Check the logs from the pod:
```
$ kubectl logs cuda-nvidia-smi-loop

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
```

Then initiate a `daemon-reload`:
```
$ sudo systemctl daemon-reload
```

Check the logs from the pod:
```
$ kubectl logs cuda-nvidia-smi-loop

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error
```

## 5. **Workarounds**

The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):

### For Docker environments

-  Using the `nvidia-ctk` utility:

	The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in `/dev/char` for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows: 
	```
	sudo nvidia-ctk system create-dev-char-symlinks \
	    --create-all
	```
	This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
  
	A simple `udev` rule to enforce this can be seen below:
	```
	# This will create /dev/char symlinks to all device nodes
	ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system 	create-dev-char-symlinks --create-all"
	```
	A good place to install this rule would be:  
	`/lib/udev/rules.d/71-nvidia-dev-char.rules`

	In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
	```
	sudo nvidia-ctk system create-dev-symlinks \
            --create-all \
            --driver-root={{NVIDIA_DRIVER_ROOT}}
	```
	Where `{{NVIDIA_DRIVER_ROOT}}` is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.

- Explicitly disabling systemd cgroup management in Docker
	- Set the parameter  `"exec-opts": ["native.cgroupdriver=cgroupfs"]` in the `/etc/docker/daemon.json` file and restart **docker**.  

- Downgrading to `docker.io` packages where `systemd` is not the default `cgroup` manager (and not overriding that of course).

### For K8s environments
- Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).

- For deployments using the standalone `k8s-device-plugin` (i.e. not through the use of the operator), following steps are required
	- When installing using `k8s-device-plugin` Helm chart, pass `--set compatWithCPUManager=true` parameter. This will ensure that `k8s-device-plugin` pod runs with env `PASS_DEVICE_SPECS=true` set. Refer to values [here](https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/helm/nvidia-device-plugin/values.yaml#L31). Please note that this will run `k8s-device-plugin` with `privileged` mode.

	- For installing using static yaml spec, pass env `PASS_DEVICE_SPECS=true` explicitly to the `k8s-device-plugin` Daemonset. Also, the pod needs to be run with `privileged` SecurityContext. For e.g. refer [here](https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml#L46).

	- Install a `udev` rule as described in the previous section can be made to work around this issue. Be sure to pass the correct `{{NVIDIA_DRIVER_ROOT}}` in cases where the driver container is also in use.

-   Explicitly disabling `systemd cgroup` management in `containerd` or `cri-o`:
	- Remove the parameter `cgroup_manager = "systemd"` from `cri-o` configuration file (usually located here: `/etc/crio/crio.conf` or `/etc/crio/crio.conf.d/00-default`) and restart `cri-o`.

- Downgrading to a version of the `containerd.io` package where `systemd` is not the default `cgroup` manager (and not overriding that, of course).

- Upgrading `runc` version to at-least `1.1.7`. This version has a [fix](https://github.com/opencontainers/runc/releases/tag/v1.1.7) to avoid the issue discussed here.  Also, `systemd` version should be `>=240`.

- When the NVIDIA driver is directly installed on the host (i.e without the driver container from the GPU Operator), make sure that following are met before device-plugin or any other containers could run. This will make sure that all required devices are injected into the containers with GPU requests.
	-  Modules`nvidia`, `nvidia-uvm`, `nvidia-modeset` are loaded using `modprobe nvidia; modprobe nvidia-uvm; modprobe nvidia-modeset`
	- All necessary control devices are created using `nvidia-modprobe -u -m -c0` and `nvidia-smi`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #485

1. Executive summary

2. Summary of the issue

3. Affected environments

4. How to check if you are affected

For Docker environments

For K8s environments

5. Workarounds

For Docker environments

For K8s environments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #485

Description

1. Executive summary

2. Summary of the issue

3. Affected environments

4. How to check if you are affected

For Docker environments

For K8s environments

5. Workarounds

For Docker environments

For K8s environments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions