-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use the "--enable-external-masters" option in the Docker checkpoint feature (integrated with CRIU)? #2472
Comments
@nwpuhkp This is a known problem. Docker, containerd and CRI-O do not currently support checkpoint/restore with NVIDIA GPUs using the CUDA plugin for CRIU. |
Okay, it's because I haven't fully understood the scope of CRIU's functionality. Thank you for your reply. I will look for other solutions. |
Can you please consider this as a feature request? This feature would be very useful to maximize GPU utilization I think. |
@seungduk-yanolja You have to talk to the corresponding projects. This is not something CRIU can solve. It works on the CRIU level. |
Oh, I see. Thanks for pointing it out. Sorry about that. |
Description
When I tried to create a checkpoint for a Docker container that is using a GPU, I encountered the error:
"Error response from daemon: Cannot checkpoint container xxxxx: nvidia-container-runtime did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/a0a79717b8f17ca7c40357c7c16e38e93b2ae7d634c923f2e3d94286a6132abc/criu-dump.log: unknown".
According to the log file, it suggests trying the --enable-external-masters option, but this option cannot be used directly in the Docker checkpoint command. What should I do to enable Docker to create a checkpoint normally?
Steps to reproduce the issue:
The command I executed is: docker checkpoint create xxxxxx checkpoint1 --leave-running=True
Describe the results you received:
docker checkpoint create autoware-hkp checkpoint1 --leave-running=True --enable-external-masters
unknown flag: --enable-external-masters
See 'docker checkpoint create --help'.
Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
CRIU logs and information:
CRIU full dump/restore logs:
Output of `criu --version`:
Output of `criu check --all`:
Additional environment details:
Ubuntu 20.04
Docker version 24.0.5, build 24.0.5-0ubuntu1~20.04.1
The text was updated successfully, but these errors were encountered: