-
Notifications
You must be signed in to change notification settings - Fork 638
process hangs on system startup after CRIU restore #1911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@PavloMykhailyshyn could you try adding |
It looks like the process is stuck on |
still the same |
I will try to run several tests and get back to you with criu logs and callstacks |
everything else waits on |
@rst0git please take a look |
but I noticed that after some period of time (perhaps 10-50 min, could be more, it depends) everything becomes normal and the process continues to work |
I attempted to dump and restore the restored process. Now it is truly stuck.
|
processes call such fucntion |
after attaching and detaching gdb from the process everything magically started to work |
but the process takes much more time to finish its work than before dump and restore |
any thoughts? |
What kernel do you use? Does it support time namespaces? Could you strace (strace -fo strace.log -s 1024 -p PID) the target process after restore? If it starts to work after gdb, it has to start work after strace too. I want to see first a few hundred lines in the log. |
5.4.0-99-generic strace after the restore one more time after previous strace the process still doesn't write any logs and probably hangs |
@PavloMykhailyshyn you can check timens supported with: criu check --feature timens |
|
@PavloMykhailyshyn I think time namespaces should fix this issue. What distro do you use? It looks like Ubuntu or Debian. Could you install the 5.10 kernel and check that the issue isn't reproduced in the new environment. |
everything works fine when I updated the kernel to the 5.10 version |
@PavloMykhailyshyn I don't know any simple way to solve this issue on old kernels. |
@avagin could you elaborate more on these time-namespaces? |
does it affect my processes in any way? |
For example, it will make our processes run with completely different timestamps |
@PavloMykhailyshyn Security protocols are often designed with the assumption that the system clock doesn't jump backwards or forwards unexpectedly (e.g., RFC 3161). However, from the perspective of an application that is transparently checkpointed and then restored at later time or on a system with different system clock this assumption doesn't hold. Thus, time namespace was introduced to address this problem. The following wiki page contains more information and links to articles about the time namespace: https://criu.org/Time_namespace |
In Linux, we have two clocks CLOCK_MONOTONIC and CLOCK_BOOTTIME that cannot be set and represents monotonic time since some unspecified starting point: These clocks are reset on reboot. After C/R, we have to guarantee that these clocks will be monotonic and will not jump to far for restored processes. |
so to be clear enough there is no way around fixing something on my end (probably fix code in restored processes) |
I only followed this issue loosely, but upgrading Kernel seems the only solution. |
I faced this problem during the restore
why did this happen? 78 processes were successfully restored, and only two failed with this error. |
@PavloMykhailyshyn It looks like this error appears because This problem should be fixed in #1579. Would you be able to test with a more recent version of CRIU? |
A friendly reminder that this issue had no activity for 30 days. |
Description
My service launches CRIU for each process (it could be the same binary with different arguments) simultaneously.
When all dumps are complete, the service ends. The system gets rebooted.
The system boots, the service starts again and goes through each dump folder. It runs CRIU on every image simultaneously.
Service did dump
/usr/sbin/criu-ns dump --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-07T15:31:40Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --tree 385732 --ghost-limit 1G --tcp-established
and restore
/usr/sbin/criu-ns restore --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-07T15:36:06Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close
Dumps and Restores are always successful, but the process hangs. It is only reproducible after the system gets rebooted. While the system is on everything works perfectly.
the process stuck here (not CRIU process, CRIU finished successfully)
How to solve this hanging problem? Maybe the connections were restored somehow wrong?
Is CRIU doing something bad or is my system causing such behavior (these reboots, etc)?
criu --version
Version: 3.16.1
The text was updated successfully, but these errors were encountered: