Detect a slow raidz child during reads #17227

robn · 2025-04-08T00:45:48Z

Motivation and Context

Replacing #16900, which was almost finished with review updates but has stalled. I've been asked to take it over.

See original PR for details.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

also added test from Tony Signed-off-by: Don Brady <[email protected]>

robn · 2025-04-08T00:48:52Z

Because there will be a little bouncing between two PRs, and because there's two different authors involved, I'll be pushing fixup commits to this branch. Once everyone is happy with review, I will squash them down for merge.

I'll close out the remaining review comments on #16900, and would like it if new comments could be added here. Thanks all for your patience; I know its a bit fiddly (it'd be nicer if Github would allow a PR to change branches, alas).

also changed 'raidz_read_sit_out_secs' to 'vdev_read_sit_out_secs' Signed-off-by: Don Brady <[email protected]>

Signed-off-by: Rob Norris <[email protected]>

Review feedback. Signed-off-by: Rob Norris <[email protected]>

Signed-off-by: Rob Norris <[email protected]>

tonyhutter · 2025-04-17T23:05:45Z

I haven't looked into it, but I see all the FreeBSD runners have:

Tests with results other than PASS that are unexpected:
    SKIP events/setup (expected PASS)
    SKIP events/slow_vdev_sit_out (expected PASS)

tonyhutter · 2025-05-15T00:49:23Z

@robn this fixed the FreeBSD CI errors for me: tonyhutter@a491397. Try squashing that in and rebasing.

behlendorf

Thanks for picking up this work! It'd be great if we can refine this a bit more and get it integrated.

behlendorf · 2025-05-16T19:00:18Z

tests/zfs-tests/include/libtest.shlib

@@ -1589,6 +1599,7 @@ function create_pool #pool devs_list

 	if is_global_zone ; then
 		[[ -d /$pool ]] && rm -rf /$pool
+		echo zpool create -f $pool $@


Why the extra "echo" when log_must already logs this? This looks like some stray debugging which should be dropped.

behlendorf · 2025-05-16T19:05:48Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+for raidtype in raidz raidz2 raidz3 draid1 draid2 draid3 ; do
+	log_must zpool create $TESTPOOL2 $raidtype $TEST_BASE_DIR/vdev.$$.{0..9}
+	log_must dd if=/dev/urandom of=/$TESTPOOL2/bigfile bs=1M count=100
+	log_must zpool export $TESTPOOL2


Let's add an comment above this zpool export / zpool import to indicate this is done to flush the ARC.

behlendorf · 2025-05-16T19:08:54Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+		fi
+	done
+
+	log_must test "$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)" == "on"


Let's close any potential race here and test against $sit_out from above instead of reading it from the pool again.

log_must test "$sit_out" == "on"

More over it looks like all three test cases implement a version of this. It could be implemented as a function and moved it redundancy/redundancy.kshlib, at a minimum though let's make sure they're all implemented the same way.

behlendorf · 2025-05-16T19:13:45Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+	# Wait for us to exit our sit out period
+	log_must wait_sit_out $TESTPOOL2 $BAD_VDEV 10
+
+	log_must test "$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)" == "off"


This check looks redundant since wait_sit_out returns an error if the timeout is reached.

behlendorf · 2025-05-16T19:25:04Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+save_tunable READ_SIT_OUT_SECS
+set_tunable32 READ_SIT_OUT_SECS 5
+
+log_must truncate -s 150M $TEST_BASE_DIR/vdev.$$.{0..9}


nit: It shouldn't matter but it'd be nice to mix up the geometry here at least a little bit. Say use a vdev count in the range of 4-10 and then randomly select one of the vdevs to be bad.

behlendorf · 2025-05-16T23:10:48Z

module/zfs/vdev_raidz.c

@@ -2745,6 +2846,156 @@ vdev_raidz_worst_error(raidz_row_t *rr)
 	return (error);
 }

+/*
+ * Find the median value from a set of n values


* Find the median value from a set of n values. For sets of an even size * and no exact middle value, the average of the two middle values is used.

behlendorf · 2025-05-16T23:28:49Z

module/zfs/vdev_raidz.c

+/*
+ * Check for any latency outlier from latest set of child reads.
+ *
+ * Uses a Tukey's fence, with K = 2, for detecting extreme outliers. This


@don-brady was a K=2 experimentally determined to be a good value for the fence? K=3 is usually used to characterize extreme outliers, but perhaps that didn't catch enough? It'd be nice to have some more visibility in to this so we could collect some usual statistics from real systems with problem drives.

behlendorf · 2025-05-16T23:47:22Z

module/zfs/vdev_raidz.c

+
+	qsort((void *)lat_data, samples, sizeof (uint64_t), latency_compare);
+	uint64_t fence = latency_quartiles_fence(lat_data, samples);
+	if (lat_data[samples - 1] > fence) {


We should be able to sit out multiple drives. What I think we want is to identify the worst outliers here (up to the parity level) and set ->vdev_read_sit_out_expire on them. Then in vdev_*_io_start_read* function we make the last minute decision if we need to issue an IO to this child. It's only when issuing the IO can we fully take in to account failed drives, known missing data, scrub/resilver IOs, IO errors, etc. Setting ->vdev_read_sit_out_expire should be considered additional advisory information and we can (and should) read this device if we need too.

behlendorf · 2025-05-16T23:53:15Z

tests/zfs-tests/tests/functional/replacement/attach_resilver_sit_out.ksh

+# Test file size in MB
+count=200
+
+for type in "raidz1" "raidz2" "raidz3" ; do


We should be testing draid[1-3] here as well. If the test run time gets to bad one trick we play in other tests is to randomly pick one of these to test. We have enough CI test runners we still get good coverage.

behlendorf · 2025-05-16T23:55:55Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+#	1. Create various raidz/draid pools
+#	2. Inject delays into one of the disks
+#	3. Verify disk is set to 'sit out' for awhile.
+#	4. Wait for READ_SIT_OUT_SECS and verify sit out state is lifted.


We really should also include some test coverage for raidz[2,3] and draid[2-3] pools with failed drives.

don-brady added 2 commits April 8, 2025 10:39

More changes from recent feedback

6ab3d31

also added test from Tony Signed-off-by: Don Brady <[email protected]>

robn mentioned this pull request Apr 8, 2025

Detect a slow raidz child during reads #16900

Closed

13 tasks

robn requested review from amotin, behlendorf, tonyhutter and don-brady April 8, 2025 01:09

don-brady and others added 4 commits April 8, 2025 11:53

Allow sit out during a resilver

0634a4c

also changed 'raidz_read_sit_out_secs' to 'vdev_read_sit_out_secs' Signed-off-by: Don Brady <[email protected]>

SPDX tags, copyright notices, housekeeping

c32db12

Signed-off-by: Rob Norris <[email protected]>

vdev_raidz: simplify q3 even/odd calc

d48c314

Review feedback. Signed-off-by: Rob Norris <[email protected]>

tests: attach_resilver_sit_out

4a1a3d3

Signed-off-by: Rob Norris <[email protected]>

robn force-pushed the raidz-detect-slow-outlier branch from 905c466 to 4a1a3d3 Compare April 8, 2025 04:15

amotin added the Status: Code Review Needed Ready for review and testing label Apr 8, 2025

behlendorf reviewed May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect a slow raidz child during reads #17227

Detect a slow raidz child during reads #17227

robn commented Apr 8, 2025

Uh oh!

robn commented Apr 8, 2025

Uh oh!

tonyhutter commented Apr 17, 2025

Uh oh!

tonyhutter commented May 15, 2025 •

edited

Loading

Uh oh!

behlendorf left a comment

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

behlendorf May 16, 2025

Uh oh!

Uh oh!

Detect a slow raidz child during reads #17227

Are you sure you want to change the base?

Detect a slow raidz child during reads #17227

Conversation

robn commented Apr 8, 2025

Motivation and Context

Types of changes

Checklist:

Uh oh!

robn commented Apr 8, 2025

Uh oh!

tonyhutter commented Apr 17, 2025

Uh oh!

tonyhutter commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tonyhutter commented May 15, 2025 •

edited

Loading