ZED still crashing after PR-11928 fix integrated #11963

don-brady · 2021-04-28T18:24:22Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	18.04
Linux Kernel	5.4.0-1045
Architecture	x86_64
ZFS Version	2021.04.27.17-3b038a4
SPL Version	2021.04.27.17-3b038a4

Describe the problem you're observing

ZED is encountering an assert due to adding a duplicate key into an AVL tree.

Describe how to reproduce the problem

We are just using stock ZED in our automated testing

Include any warning/errors/backtraces from the system logs

#1  0x00007feb6a1e5921 in __GI_abort () at abort.c:79
#2  0x00007feb6b20d805 in libspl_assertf (file=0x7feb6b21c199 "../../module/avl/avl.c", func=0x7feb6b21c1b0 <__FUNCTION__.5089> "avl_add", line=640, format=<optimized out>)
    at assert.c:45
#3  0x00007feb6b215cfc in libspl_assert (line=640, func=<synthetic pointer>, file=0x7feb6b21c199 "../../module/avl/avl.c",
    buf=0x7feb6b21c170 "avl_find(tree, new_node, &where) == NULL") at ../../lib/libspl/include/assert.h:46
#4  0x00007feb6b21601b in avl_add (tree=tree@entry=0x5639784be320 <_launched_processes>, new_node=new_node@entry=0x563979d54de0) at ../../module/avl/avl.c:640
#5  0x00005639782b14f9 in _zed_exec_fork_child (eid=eid@entry=1644, dir=<optimized out>, prog=prog@entry=0x563979d4e3d0 "all-syslog.sh", env=env@entry=0x563979d57bc0,
    zfd=7, in_foreground=B_TRUE) at zed_exec.c:180
#6  0x00005639782b1b5f in zed_exec_process (eid=1644, class=<optimized out>, subclass=subclass@entry=0x563979d551b0 "history_event", zcp=zcp@entry=0x7ffe923a8940,
    envs=envs@entry=0x563979d577f0) at zed_exec.c:361
#7  0x00005639782b1253 in zed_event_service (zcp=0x7ffe923a8940) at zed_event.c:979
#8  0x00005639782ac7e0 in main (argc=<optimized out>, argv=<optimized out>) at zed.c:291
(gdb) frame 4
#4  0x00007feb6b21601b in avl_add (tree=tree@entry=0x5639784be320 <_launched_processes>, new_node=new_node@entry=0x563979d54de0) at ../../module/avl/avl.c:640
640	../../module/avl/avl.c: No such file or directory.
(gdb)  p *(struct launched_process_node *)0x563979d54de0
$1 = {node = {avl_child = {0x0, 0x0}, avl_pcb = 0}, pid = 24641, eid = 1644, name = 0x563979d576b0 "all-syslog.sh"}
(gdb) p *tree
$2 = {avl_root = 0x563979d574a0, avl_compar = 0x5639782b12a0 <_launched_process_node_compare>, avl_offset = 0, avl_numnodes = 1, avl_size = 48}
(gdb) p *(struct avl_node *) 0x563979d574a0
$3 = {avl_child = {0x0, 0x0}, avl_pcb = 1}
(gdb) frame 5
#5  0x00005639782b14f9 in _zed_exec_fork_child (eid=eid@entry=1644, dir=<optimized out>, prog=prog@entry=0x563979d4e3d0 "all-syslog.sh", env=env@entry=0x563979d57bc0,
    zfd=7, in_foreground=B_TRUE) at zed_exec.c:180
180	zed_exec.c: No such file or directory.
(gdb) p _launched_processes_lock
$4 = pthread_mutex_t = {Type = Normal, Status = Acquired, possibly with no waiters, Owner ID = 1851, Robust = No, Shared = No, Protocol = None}
(gdb) p pid
$5 = 24641
(gdb) p eid
$6 = 1644
(gdb) thread 3
[Switching to thread 3 (Thread 0x7feb637fe700 (LWP 1936))]
#0  0x00007feb6a5a7fc2 in __libc_pause () at ../sysdeps/unix/sysv/linux/pause.c:30
30	../sysdeps/unix/sysv/linux/pause.c: No such file or directory.
(gdb) bt
#0  0x00007feb6a5a7fc2 in __libc_pause () at ../sysdeps/unix/sysv/linux/pause.c:30
#1  0x00005639782b18b5 in _reap_children (arg=<optimized out>) at zed_exec.c:214
#2  0x00007feb6a59d6db in start_thread (arg=0x7feb637fe700) at pthread_create.c:463
#3  0x00007feb6a2c671f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) frame 1
#1  0x00005639782b18b5 in _reap_children (arg=<optimized out>) at zed_exec.c:214
214	zed_exec.c: No such file or directory.
(gdb) p pid
$7 = -1

The text was updated successfully, but these errors were encountered:

behlendorf · 2021-04-28T19:44:37Z

cc: @nabijaczleweli

nabijaczleweli · 2021-04-28T20:19:42Z

blergh. so, apparently, and this is fun:

A process ID shall not be reused by the system until the process lifetime ends.

good, great, this is defined as

The period of time that begins when a process is created and ends when its process ID is returned to the system.

and then

The remains of a live process (see Live Process) after it terminates (see Process Termination) and before its status information (see XSH Status Information) is consumed by its parent process.

and a Live Process is

An address space with one or more threads executing within that address space, and the required system resources for those threads.
Many of the system resources defined by POSIX.1-2017 are shared among all of the threads within a process. These include the process ID, [&c.].

However, according to Linux man-pages' wait(2):

The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

So, according to POSIX, re-using the PID for an already existing zombie might or might not be legal, depending on how you choose to read it (it appears to be defined as "its reused when it can be", which is very helpful), but according to Linux man-pages, it should not happen, i.e. a Live Process cannot share a PID with a Zombie Process.

And, since the reaper thread is in the idle state with pid = -1, it means that the last time it waited there were no processes to wait for since wait() returned ECHILD.

a riddle for the ages

Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#11963

nabijaczleweli · 2021-04-28T21:09:28Z

@don-brady can you try #11965 maybe? if it works, then great, but if not, then maybe it's time to come up with an alternative solution

h1z1 · 2021-05-01T10:27:23Z

"its reused when it can be"

Not quite. It's reused after it's reaped, that can be from parent or init. Don't think even SunOS recycled a zombie pid. It is a common mistake to assume the pid you saved will in fact refer to the same process later.

The child and parent may not even exist in the same namespaces :)

nabijaczleweli · 2021-05-01T11:03:32Z

Well, the POSIX definition is circular, but I really don't expect any system to reuse zombie PIDs, which is why it was so surprising to hit this.
But considering we are the parent and track the PID from its conception until we rip apart its steaming corpse, that is guaranteed to be the same process. Which, again, is why it was so surprising to hit this.

Thinking about it again, we could race fork-die-reap-metadata_insertion, but that (and more) is handled in #11965.

Also, changing namespaces alters the view from within, not from without.

nabijaczleweli · 2021-05-01T14:02:20Z

Great news! I managed to reproduce this by starving the system of PIDs and doing

diff --git a/cmd/zed/zed_exec.c b/cmd/zed/zed_exec.c
index 9b0775a74..9948517a7 100644
--- a/cmd/zed/zed_exec.c
+++ b/cmd/zed/zed_exec.c
@@ -206,6 +208,7 @@ _reap_children(void *arg)

        for (_reap_children_stop = B_FALSE; !_reap_children_stop; ) {
                (void) pthread_mutex_lock(&_launched_processes_lock);
+               sleep(1);
                pid = wait4(0, &status, WNOHANG, &usage);

Manifests as yours has:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0xf7d26515 in __GI_abort () at abort.c:79
#2  0xf7f46e05 in ?? () from /lib/x86_64-linux-gnux32/libzfs_core.so.3
#3  0xf7f4ed58 in avl_add () from /lib/x86_64-linux-gnux32/libzfs_core.so.3
#4  0x00408a96 in _zed_exec_fork_child (eid=eid@entry=2485, dir=<optimized out>,
    prog=prog@entry=0x41f310 "history_event-zfs-list-cacher.sh", env=env@entry=0x5ec4d0, zfd=<optimized out>,
    in_foreground=B_TRUE) at zed_exec.c:182
#5  0x00408e40 in zed_exec_process (eid=2485, class=<optimized out>, subclass=subclass@entry=0x5e7d08 "history_event",
    zcp=zcp@entry=0xffffd530, envs=envs@entry=0x5eabe0) at zed_exec.c:364
#6  0x00408536 in zed_event_service (zcp=zcp@entry=0xffffd530) at zed_event.c:979
#7  0x00404090 in main (argc=<optimized out>, argv=<optimized out>) at zed.c:291

With this node being inserted:

(gdb) p *(struct launched_process_node *)0x5e9b40
$14 = {node = {avl_child = {0x0, 0x0}, avl_parent = 0x0, avl_child_index = 0, avl_balance = 0}, pid = 353, eid = 2485,
  name = 0x5e9a30 "history_event-zfs-list-cacher.sh"}

and this one already existing:

(gdb) p *(struct launched_process_node *)avl_find(&_launched_processes, (struct launched_process_node *)0x5e9b40, (void*)0)
$13 = {node = {avl_child = {0x5e8030, 0x5e84a0}, avl_parent = 0x5ea790, avl_child_index = 0, avl_balance = -1},
  pid = 353, eid = 2477, name = 0x5e9080 "all-debug.sh"}

This can be very easily triggered by adding a sleep(1) before the wait4() on a PID-starved system: the reaper thread would wait for a child before its entry appeared, letting old entries accumulate: Invoking "all-debug.sh" eid=3021 pid=391 Finished "(null)" eid=0 pid=391 time=0.002432s exit=0 Invoking "all-syslog.sh" eid=3021 pid=336 Finished "(null)" eid=0 pid=336 time=0.002432s exit=0 Invoking "history_event-zfs-list-cacher.sh" eid=3021 pid=347 Invoking "all-debug.sh" eid=3022 pid=349 Finished "history_event-zfs-list-cacher.sh" eid=3021 pid=347 time=0.001669s exit=0 Finished "(null)" eid=0 pid=349 time=0.002404s exit=0 Invoking "all-syslog.sh" eid=3022 pid=370 Finished "(null)" eid=0 pid=370 time=0.002427s exit=0 Invoking "history_event-zfs-list-cacher.sh" eid=3022 pid=391 avl_find(tree, new_node, &where) == NULL ASSERT at ../../module/avl/avl.c:641:avl_add() Thread 1 "zed" received signal SIGABRT, Aborted. By employing this wider lock, we atomise [wait, remove] and [fork, add]: slowing down the reaper thread now just causes some zombies to accumulate until it can get to them Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#11963

This can be very easily triggered by adding a sleep(1) before the wait4() on a PID-starved system: the reaper thread would wait for a child before its entry appeared, letting old entries accumulate: Invoking "all-debug.sh" eid=3021 pid=391 Finished "(null)" eid=0 pid=391 time=0.002432s exit=0 Invoking "all-syslog.sh" eid=3021 pid=336 Finished "(null)" eid=0 pid=336 time=0.002432s exit=0 Invoking "history_event-zfs-list-cacher.sh" eid=3021 pid=347 Invoking "all-debug.sh" eid=3022 pid=349 Finished "history_event-zfs-list-cacher.sh" eid=3021 pid=347 time=0.001669s exit=0 Finished "(null)" eid=0 pid=349 time=0.002404s exit=0 Invoking "all-syslog.sh" eid=3022 pid=370 Finished "(null)" eid=0 pid=370 time=0.002427s exit=0 Invoking "history_event-zfs-list-cacher.sh" eid=3022 pid=391 avl_find(tree, new_node, &where) == NULL ASSERT at ../../module/avl/avl.c:641:avl_add() Thread 1 "zed" received signal SIGABRT, Aborted. By employing this wider lock, we atomise [wait, remove] and [fork, add]: slowing down the reaper thread now just causes some zombies to accumulate until it can get to them Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#11963 Closes openzfs#11965

don-brady added Type: Defect Incorrect behavior (e.g. crash, hang) Status: Triage Needed New issue which needs to be triaged labels Apr 28, 2021

nabijaczleweli added a commit to nabijaczleweli/zfs that referenced this issue Apr 28, 2021

zed: fully lock around all syscalls that affect the global PID table

ef339a2

Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#11963

nabijaczleweli added a commit to nabijaczleweli/zfs that referenced this issue Apr 28, 2021

zed: fully lock around fork() and in the SIGCHLD handler

a1314d3

Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#11963

nabijaczleweli added a commit to nabijaczleweli/zfs that referenced this issue Apr 28, 2021

zed: fully lock around fork() and in the SIGCHLD handler

62a2940

Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#11963

nabijaczleweli mentioned this issue Apr 28, 2021

zed: a lock-based pickle indeed #11965

Merged

13 tasks

behlendorf closed this as completed in 3bd6b0e May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ZED still crashing after PR-11928 fix integrated #11963

ZED still crashing after PR-11928 fix integrated #11963

don-brady commented Apr 28, 2021

behlendorf commented Apr 28, 2021

Uh oh!

nabijaczleweli commented Apr 28, 2021 •

edited

Loading

Uh oh!

nabijaczleweli commented Apr 28, 2021

Uh oh!

h1z1 commented May 1, 2021

Uh oh!

nabijaczleweli commented May 1, 2021 •

edited

Loading

Uh oh!

nabijaczleweli commented May 1, 2021 •

edited

Loading

Uh oh!

ZED still crashing after PR-11928 fix integrated #11963

ZED still crashing after PR-11928 fix integrated #11963

Comments

don-brady commented Apr 28, 2021

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

behlendorf commented Apr 28, 2021

Uh oh!

nabijaczleweli commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nabijaczleweli commented Apr 28, 2021

Uh oh!

h1z1 commented May 1, 2021

Uh oh!

nabijaczleweli commented May 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nabijaczleweli commented May 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nabijaczleweli commented Apr 28, 2021 •

edited

Loading

nabijaczleweli commented May 1, 2021 •

edited

Loading

nabijaczleweli commented May 1, 2021 •

edited

Loading