Skip to content

process hangs on system startup after CRIU restore #1911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PavloMykhailyshyn opened this issue Jun 7, 2022 · 31 comments
Open

process hangs on system startup after CRIU restore #1911

PavloMykhailyshyn opened this issue Jun 7, 2022 · 31 comments

Comments

@PavloMykhailyshyn
Copy link

PavloMykhailyshyn commented Jun 7, 2022

Description

My service launches CRIU for each process (it could be the same binary with different arguments) simultaneously.
When all dumps are complete, the service ends. The system gets rebooted.
The system boots, the service starts again and goes through each dump folder. It runs CRIU on every image simultaneously.

Service did dump
/usr/sbin/criu-ns dump --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-07T15:31:40Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --tree 385732 --ghost-limit 1G --tcp-established

and restore
/usr/sbin/criu-ns restore --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-07T15:36:06Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close

Dumps and Restores are always successful, but the process hangs. It is only reproducible after the system gets rebooted. While the system is on everything works perfectly.

the process stuck here (not CRIU process, CRIU finished successfully)

Thread 30 (Thread 0x7f01b27fc700 (LWP 2799)):
#0  __libc_read (nbytes=5, buf=0x11bd0f73, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=6, buf=0x11bd0f73, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f01cad0e3d9 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007f01cad0967e in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007f01cad084d4 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007f01cad08aa7 in BIO_read () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007f01cabf1b91 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007f01cabf5e1e in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007f01cabf36d0 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007f01cabfac45 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007f01cac05a3f in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007f01cac05b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1

How to solve this hanging problem? Maybe the connections were restored somehow wrong?
Is CRIU doing something bad or is my system causing such behavior (these reboots, etc)?

criu --version
Version: 3.16.1

@PavloMykhailyshyn
Copy link
Author

@Snorch and @rst0git
Could you be so kind to suggest something about this?

@rst0git
Copy link
Member

rst0git commented Jun 8, 2022

@PavloMykhailyshyn could you try adding --tcp-close to the dump command as well?
(We recently enabled this option for dump as well.)

@rst0git
Copy link
Member

rst0git commented Jun 8, 2022

#11 0x00007f01cac05b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1

It looks like the process is stuck on SSL_read() reading from closed TCP socket.
https://www.openssl.org/docs/man1.1.1/man3/SSL_read.html

@PavloMykhailyshyn
Copy link
Author

still the same

@PavloMykhailyshyn
Copy link
Author

I will try to run several tests and get back to you with criu logs and callstacks

@PavloMykhailyshyn
Copy link
Author

PavloMykhailyshyn commented Jun 16, 2022

/usr/sbin/criu-ns dump --images-dir /var/lib/myservice/agent-dumps/images/image-id --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-16T09:37:55Z_image-id.log --action-script /usr/local/sbin/criu_action_script.sh --tree 3469854 --ghost-limit 1G --tcp-close
dump.txt

/usr/sbin/criu-ns restore --images-dir /var/lib/myservice/agent-dumps/images/image-id --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-16T09:45:38Z_image-id.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close
restore.txt

Thread 28 (Thread 0x7fed2ffe7700 (LWP 2550)):
#0  __libc_read (nbytes=5, buf=0x7fed3c058f43, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=6, buf=0x7fed3c058f43, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007fed48ade3d9 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007fed48ad967e in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007fed48ad84d4 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007fed48ad8aa7 in BIO_read () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007fed489c1b91 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007fed489c5e1e in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007fed489c36d0 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007fed489cac45 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fed489d5a3f in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fed489d5b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
Thread 10 (Thread 0x7fed42521700 (LWP 2532)):
#0  __libc_recv (flags=<optimised out>, len=1, buf=0x7fed4251eca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=5, buf=0x7fed4251eca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23

everything else waits on futex_wait_cancelable

@PavloMykhailyshyn
Copy link
Author

@rst0git please take a look

@PavloMykhailyshyn
Copy link
Author

PavloMykhailyshyn commented Jun 16, 2022

but I noticed that after some period of time (perhaps 10-50 min, could be more, it depends) everything becomes normal and the process continues to work

@PavloMykhailyshyn
Copy link
Author

I attempted to dump and restore the restored process. Now it is truly stuck.
I did it for two processes:

  • first one
#0  __libc_recv (flags=<optimised out>, len=1, buf=0x7f8b3effaca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=5, buf=0x7f8b3effaca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
  • second
#0  __libc_recv (flags=<optimised out>, len=1, buf=0x7fa30d7f7ca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=5, buf=0x7fa30d7f7ca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23

@PavloMykhailyshyn
Copy link
Author

processes call such fucntion
recv(m_sock, buf, block_read, 0);

@PavloMykhailyshyn
Copy link
Author

@Snorch & @rst0git
Please suggest something here
Does the bug exist in criu or is it a problem on my end?

@PavloMykhailyshyn
Copy link
Author

after attaching and detaching gdb from the process everything magically started to work

@PavloMykhailyshyn
Copy link
Author

but the process takes much more time to finish its work than before dump and restore
(e.g. in a normal situation it lasts 40 min, after dump and restore almost 5 hours)

@PavloMykhailyshyn
Copy link
Author

any thoughts?
@rst0git @Snorch @avagin @adrianreber
The problem is crucial, I need a suggestion here

@avagin
Copy link
Member

avagin commented Jun 20, 2022

What kernel do you use? Does it support time namespaces? Could you strace (strace -fo strace.log -s 1024 -p PID) the target process after restore? If it starts to work after gdb, it has to start work after strace too. I want to see first a few hundred lines in the log.

@PavloMykhailyshyn
Copy link
Author

@avagin

5.4.0-99-generic
I do not know, where can I find it? From Wikipedia The time namespace allows processes to see different system times in a way similar to the UTS namespace. It was proposed in 2018 and landed on Linux 5.6, which was released in March 2020
It may not be supported by my kernel.

strace after the restore
strace.txt

one more time after previous strace
strace1.txt

the process still doesn't write any logs and probably hangs

@rst0git
Copy link
Member

rst0git commented Jun 21, 2022

@PavloMykhailyshyn you can check timens supported with:

criu check --feature timens

@PavloMykhailyshyn
Copy link
Author

Error (criu/cr-check.c:1315): Time namespaces are not supported

@avagin
Copy link
Member

avagin commented Jun 21, 2022

@PavloMykhailyshyn I think time namespaces should fix this issue. What distro do you use? It looks like Ubuntu or Debian. Could you install the 5.10 kernel and check that the issue isn't reproduced in the new environment.

@PavloMykhailyshyn
Copy link
Author

everything works fine when I updated the kernel to the 5.10 version
is there a way to fix the problem on the 5.4 version? (or is this the only way to solve the hanging problem?)

@avagin
Copy link
Member

avagin commented Jun 21, 2022

@PavloMykhailyshyn I don't know any simple way to solve this issue on old kernels.

@PavloMykhailyshyn
Copy link
Author

@avagin could you elaborate more on these time-namespaces?
Can you explain why this solves my problem?

@PavloMykhailyshyn
Copy link
Author

does it affect my processes in any way?

@PavloMykhailyshyn
Copy link
Author

For example, it will make our processes run with completely different timestamps

@rst0git
Copy link
Member

rst0git commented Jun 22, 2022

@PavloMykhailyshyn Security protocols are often designed with the assumption that the system clock doesn't jump backwards or forwards unexpectedly (e.g., RFC 3161). However, from the perspective of an application that is transparently checkpointed and then restored at later time or on a system with different system clock this assumption doesn't hold. Thus, time namespace was introduced to address this problem. The following wiki page contains more information and links to articles about the time namespace: https://criu.org/Time_namespace

@avagin
Copy link
Member

avagin commented Jun 22, 2022

In Linux, we have two clocks CLOCK_MONOTONIC and CLOCK_BOOTTIME that cannot be set and represents monotonic time since some unspecified starting point:
https://linux.die.net/man/2/clock_gettime

These clocks are reset on reboot. After C/R, we have to guarantee that these clocks will be monotonic and will not jump to far for restored processes.

@PavloMykhailyshyn
Copy link
Author

so to be clear enough there is no way around fixing something on my end (probably fix code in restored processes)
the only way is to upgrade the kernel
am I right? or we can workaround somehow this?

@adrianreber
Copy link
Member

I only followed this issue loosely, but upgrading Kernel seems the only solution.

@PavloMykhailyshyn
Copy link
Author

PavloMykhailyshyn commented Jul 1, 2022

I faced this problem during the restore

(00.166452) 137041: Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
(00.166468) 137041: Error (criu/files.c:1200): Unable to open fd=2 id=0x17

criu_restore.txt

why did this happen? 78 processes were successfully restored, and only two failed with this error.
@adrianreber @avagin @rst0git

@rst0git
Copy link
Member

rst0git commented Jul 1, 2022

(00.166424) 137041: tcp: Restoring TCP connection
(00.166431) 137041: tcp: Restoring TCP connection id 17 ino 9ac87
(00.166442) 137041: No tcp-stream-9ac87.img image
(00.166452) 137041: Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)

@PavloMykhailyshyn It looks like this error appears because tcp-stream-9ac87.img is missing.

This problem should be fixed in #1579.

Would you be able to test with a more recent version of CRIU?

@github-actions
Copy link

github-actions bot commented Aug 1, 2022

A friendly reminder that this issue had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants