Zpool can start allocating from metaslab before TRIMs have completed #15395

jasonbking · 2023-10-11T16:59:08Z

Wait for all TRIM zios for a metaslab to complete before re-enabling it.

Motivation and Context

During some load testing on a small, single disk zpool, we encountered consistent pool corruption with a combination of multiple parallel write workloads while running a manual TRIM (zpool trim) on a pool.

Description

When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption.

This was discovered on illumos (illumos#15939, however inspection of the OpenZFS codebase shows that the code in question is largely unchanged and appears vulnerable to the same corruption.

The fix is fairly simple. Once all the allocable (i.e. free) ranges for a metaslab have been issued, wait for them to complete before re-enabling the metaslab and moving onto the next metaslab to TRIM.

How Has This Been Tested?

Running the same workload (multiple write streams to a small, single disk zpool while running TRIM) that consistently (and quickly) produced corruption no longer caused corruption once the change had been applied.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

behlendorf · 2023-10-11T23:29:51Z

@jasonbking can you force update this PR and add your Signed-off-by to the commit message; git commit --amend -s. Then it should be good to go.

When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. The illumos bug for this is illumos#15939. Signed-off-by: Jason King <[email protected]>

When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Illumos-issue: https://www.illumos.org/issues/15939 Closes #15395

When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Illumos-issue: https://www.illumos.org/issues/15939 Closes openzfs#15395

When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Illumos-issue: https://www.illumos.org/issues/15939 Closes #15395

When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Illumos-issue: https://www.illumos.org/issues/15939 Closes openzfs#15395

jasonbking force-pushed the trim-fix branch from f0c8cc2 to 0517040 Compare October 11, 2023 16:59

jasonbking marked this pull request as ready for review October 11, 2023 17:00

amotin approved these changes Oct 11, 2023

View reviewed changes

behlendorf approved these changes Oct 11, 2023

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 11, 2023

ikozhukhov approved these changes Oct 11, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 11, 2023

jasonbking force-pushed the trim-fix branch from 3a3da93 to eda666d Compare October 12, 2023 02:46

behlendorf merged commit 8a74070 into openzfs:master Oct 12, 2023

rincebrain mentioned this pull request Dec 9, 2023

Data corruption after TRIM #14513

Open

rincebrain mentioned this pull request Dec 14, 2023

Input/output error in recent snapshot; three times on same host now #15474

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zpool can start allocating from metaslab before TRIMs have completed #15395

Zpool can start allocating from metaslab before TRIMs have completed #15395

Uh oh!

jasonbking commented Oct 11, 2023

Uh oh!

behlendorf commented Oct 11, 2023

Uh oh!

Uh oh!

Zpool can start allocating from metaslab before TRIMs have completed #15395

Zpool can start allocating from metaslab before TRIMs have completed #15395

Uh oh!

Conversation

jasonbking commented Oct 11, 2023

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

behlendorf commented Oct 11, 2023

Uh oh!

Uh oh!