Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the "--enable-external-masters" option in the Docker checkpoint feature (integrated with CRIU)? #2472

Closed
nwpuhkp opened this issue Aug 19, 2024 · 5 comments

Comments

@nwpuhkp
Copy link

nwpuhkp commented Aug 19, 2024

Description

When I tried to create a checkpoint for a Docker container that is using a GPU, I encountered the error:

"Error response from daemon: Cannot checkpoint container xxxxx: nvidia-container-runtime did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/a0a79717b8f17ca7c40357c7c16e38e93b2ae7d634c923f2e3d94286a6132abc/criu-dump.log: unknown".

According to the log file, it suggests trying the --enable-external-masters option, but this option cannot be used directly in the Docker checkpoint command. What should I do to enable Docker to create a checkpoint normally?

Steps to reproduce the issue:
The command I executed is: docker checkpoint create xxxxxx checkpoint1 --leave-running=True

Describe the results you received:
docker checkpoint create autoware-hkp checkpoint1 --leave-running=True --enable-external-masters
unknown flag: --enable-external-masters
See 'docker checkpoint create --help'.

Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

CRIU logs and information:

CRIU full dump/restore logs:

(00.012402) mnt: <--
(00.012403) mnt:        The mount 2449 is bind for 2450 (@./dev/nvidia-uvm-tools -> @./dev/nvidia0)
(00.012404) mnt:        The mount 2448 is bind for 2450 (@./dev/nvidia-uvm -> @./dev/nvidia0)
(00.012405) mnt:        The mount 2447 is bind for 2450 (@./dev/nvidiactl -> @./dev/nvidia0)
(00.012406) mnt:        The mount 2444 is bind for 2445 (@./lib/firmware/nvidia/555.42.06/gsp_ga10x.bin -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012406) mnt:        The mount 2440 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012407) mnt:        The mount 2439 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012408) mnt:        The mount 2438 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012408) mnt:        The mount 2437 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012409) mnt:        The mount 2436 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012410) mnt:        The mount 2435 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvoptix.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012410) mnt:        The mount 2434 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012411) mnt:        The mount 2433 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012411) mnt:        The mount 2432 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012412) mnt:        The mount 2431 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-tls.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012413) mnt:        The mount 2430 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012413) mnt:        The mount 2429 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012414) mnt:        The mount 2428 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012415) mnt:        The mount 2427 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012415) mnt:        The mount 2426 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012416) mnt:        The mount 2425 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012418) mnt:        The mount 2424 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012419) mnt:        The mount 2423 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012419) mnt:        The mount 2422 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012420) mnt:        The mount 2421 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libcudadebugger.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012421) mnt:        The mount 2420 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libcuda.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012421) mnt:        The mount 2419 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012422) mnt:        The mount 2418 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-ml.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012422) mnt:        The mount 2417 is bind for 2445 (@./usr/bin/nvidia-cuda-mps-server -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012423) mnt:        The mount 2416 is bind for 2445 (@./usr/bin/nvidia-cuda-mps-control -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012424) mnt:        The mount 2415 is bind for 2445 (@./usr/bin/nvidia-persistenced -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012424) mnt:        The mount 2414 is bind for 2445 (@./usr/bin/nvidia-debugdump -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012425) mnt:        The mount 2413 is bind for 2445 (@./usr/bin/nvidia-smi -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012426) mnt:        The mount 2410 is bind for 2445 (@./usr/share/vulkan/implicit_layer.d/nvidia_layers.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012426) mnt:        The mount 2409 is bind for 2445 (@./usr/share/vulkan/icd.d/nvidia_icd.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012427) mnt:        The mount 2408 is bind for 2445 (@./usr/share/glvnd/egl_vendor.d/10_nvidia.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012427) mnt:        The mount 2407 is bind for 2445 (@./usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012428) mnt:        The mount 2406 is bind for 2445 (@./usr/share/X11/xorg.conf.d/10-nvidia.conf -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012429) mnt:        The mount 2405 is bind for 2445 (@./lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012429) mnt:        The mount 2404 is bind for 2445 (@./lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012430) mnt:        The mount 2389 is bind for 2445 (@./lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012431) mnt:        The mount 2388 is bind for 2445 (@./home/hkp/.Xauthority -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012431) mnt:        The mount 2387 is bind for 2445 (@./home/autoware/autoware-contents -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012432) mnt:        The mount 2386 is bind for 2445 (@./tmp/fuzzerdata -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012433) mnt:        The mount 2385 is bind for 2445 (@./tmp/.X11-unix -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012433) mnt:        The mount 2384 is bind for 2445 (@./etc/resolv.conf -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012434) mnt:        The mount 2383 is bind for 2445 (@./etc/hosts -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012435) mnt:        The mount 2382 is bind for 2445 (@./etc/hostname -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin)
(00.012435) mnt:        The mount 2442 is bind for 2443 (@./usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 -> @./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129)
(00.012436) mnt:        The mount 2441 is bind for 2443 (@./usr/lib/x86_64-linux-gnu/libcuda.so.410.129 -> @./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129)
(00.012438) mnt:        The mount 2375 is bind for 2443 (@./ -> @./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129)
(00.012445) mnt: Found /dev/nvidia0 mapping for ./dev/nvidia0 mountpoint
(00.012446) mnt: Found /dev/nvidia-uvm-tools mapping for ./dev/nvidia-uvm-tools mountpoint
(00.012448) mnt: Found /dev/nvidia-uvm mapping for ./dev/nvidia-uvm mountpoint
(00.012449) mnt: Found /dev/nvidiactl mapping for ./dev/nvidiactl mountpoint
(00.012477) mnt: Found /usr/share/vulkan/implicit_layer.d/nvidia_layers.json mapping for ./usr/share/vulkan/implicit_layer.d/nvidia_layers.json mountpoint
(00.012478) mnt: Found /usr/share/vulkan/icd.d/nvidia_icd.json mapping for ./usr/share/vulkan/icd.d/nvidia_icd.json mountpoint
(00.012480) mnt: Found /usr/share/glvnd/egl_vendor.d/10_nvidia.json mapping for ./usr/share/glvnd/egl_vendor.d/10_nvidia.json mountpoint
(00.012481) mnt: Found /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json mapping for ./usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json mountpoint
(00.012482) mnt: Found /usr/share/X11/xorg.conf.d/10-nvidia.conf mapping for ./usr/share/X11/xorg.conf.d/10-nvidia.conf mountpoint
(00.012483) mnt: Found /lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so mapping for ./lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so mountpoint
(00.012485) mnt: Found /lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.555.42.06 mapping for ./lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.555.42.06 mountpoint
(00.012486) mnt: Found /sys/fs/cgroup/cpuset mapping for ./sys/fs/cgroup/cpuset mountpoint
(00.012487) mnt: Found /sys/fs/cgroup/blkio mapping for ./sys/fs/cgroup/blkio mountpoint
(00.012489) mnt: Found /sys/fs/cgroup/rdma mapping for ./sys/fs/cgroup/rdma mountpoint
(00.012490) mnt: Found /sys/fs/cgroup/perf_event mapping for ./sys/fs/cgroup/perf_event mountpoint
(00.012491) mnt: Found /sys/fs/cgroup/pids mapping for ./sys/fs/cgroup/pids mountpoint
(00.012493) mnt: Found /sys/fs/cgroup/net_cls,net_prio mapping for ./sys/fs/cgroup/net_cls,net_prio mountpoint
(00.012494) mnt: Found /sys/fs/cgroup/devices mapping for ./sys/fs/cgroup/devices mountpoint
(00.012495) mnt: Found /sys/fs/cgroup/misc mapping for ./sys/fs/cgroup/misc mountpoint
(00.012496) mnt: Found /sys/fs/cgroup/cpu,cpuacct mapping for ./sys/fs/cgroup/cpu,cpuacct mountpoint
(00.012498) mnt: Found /sys/fs/cgroup/memory mapping for ./sys/fs/cgroup/memory mountpoint
(00.012499) mnt: Found /sys/fs/cgroup/freezer mapping for ./sys/fs/cgroup/freezer mountpoint
(00.012500) mnt: Found /sys/fs/cgroup/hugetlb mapping for ./sys/fs/cgroup/hugetlb mountpoint
(00.012501) mnt: Found /sys/fs/cgroup/systemd mapping for ./sys/fs/cgroup/systemd mountpoint
(00.012503) mnt: Found /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1 mapping for ./lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1 mountpoint
(00.012505) mnt: Found /home/hkp/.Xauthority mapping for ./home/hkp/.Xauthority mountpoint
(00.012506) mnt: Found /home/autoware/autoware-contents mapping for ./home/autoware/autoware-contents mountpoint
(00.012507) mnt: Found /tmp/fuzzerdata mapping for ./tmp/fuzzerdata mountpoint
(00.012509) mnt: Found /tmp/.X11-unix mapping for ./tmp/.X11-unix mountpoint
(00.012510) mnt: Found /etc/resolv.conf mapping for ./etc/resolv.conf mountpoint
(00.012511) mnt: Found /etc/hosts mapping for ./etc/hosts mountpoint
(00.012513) mnt: Found /etc/hostname mapping for ./etc/hostname mountpoint
(00.012518) mnt: Inspecting sharing on 2451 shared_id 0 master_id 15 (@./proc/driver/nvidia/gpus/0000:01:00.0)
(00.012519) Error (criu/mount.c:1088): mnt: Mount 2451 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

Output of `criu --version`:

Version: 3.19

Output of `criu check --all`:

Can't check shutdown state of inet socket
Warn  (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

Additional environment details:
Ubuntu 20.04
Docker version 24.0.5, build 24.0.5-0ubuntu1~20.04.1

@rst0git
Copy link
Member

rst0git commented Aug 19, 2024

Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

@nwpuhkp This is a known problem. Docker, containerd and CRI-O do not currently support checkpoint/restore with NVIDIA GPUs using the CUDA plugin for CRIU.

@nwpuhkp
Copy link
Author

nwpuhkp commented Aug 20, 2024

Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.错误 (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) 无法访问共享。尝试 --enable-external-masters。

@nwpuhkp This is a known problem. Docker, containerd and CRI-O do not currently support checkpoint/restore with NVIDIA GPUs using the CUDA plugin for CRIU.@nwpuhkp 这是一个已知问题。Docker、containerd 和 CRI-O 目前不支持使用 CRIU 的 CUDA 插件对 NVIDIA GPU 进行检查点/恢复。

Okay, it's because I haven't fully understood the scope of CRIU's functionality. Thank you for your reply. I will look for other solutions.

@nwpuhkp nwpuhkp closed this as completed Aug 20, 2024
@seungduk-yanolja
Copy link

Can you please consider this as a feature request? This feature would be very useful to maximize GPU utilization I think.

@adrianreber
Copy link
Member

@seungduk-yanolja You have to talk to the corresponding projects. This is not something CRIU can solve. It works on the CRIU level.

@seungduk-yanolja
Copy link

@seungduk-yanolja You have to talk to the corresponding projects. This is not something CRIU can solve. It works on the CRIU level.

Oh, I see. Thanks for pointing it out. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants