Skip to content

[FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb #6778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 11, 2025

Conversation

souravagrawal
Copy link
Contributor

Fixes an edge case where the stream state could reset during recovery in WorkQueue (WQ) streams.

In WQ streams, messages are removed immediately upon ACK. Once a message block becomes empty, NATS deletes the corresponding block file. After deleting the last message block (lmb), NATS also creates a new block file that carries the latest sequence and timestamp from lmb.

If NATS crashes between deleting the lmb and creating the new block file, on NATS reboot the recovery logic cannot restore the stream state due to the absence of any block files. This leads to a reset to sequence 0, potentially causing inconsistencies such as the consumer sequence being higher than the stream sequence.

This change updates the logic to first create the new block file before deleting the lmb. This ensures that there is always at least one valid block on disk, allowing for consistent recovery even if NATS is terminated unexpectedly during the transition.

PR fixes : #6600

Signed-off-by: Sourabh Agrawal [email protected]

@souravagrawal souravagrawal marked this pull request as ready for review April 9, 2025 13:58
@souravagrawal souravagrawal requested a review from a team as a code owner April 9, 2025 13:58
@shiv4289
Copy link

@souravagrawal Thank you for the patch. Can we put some debug logs when we delete and create new blk files please?

@shiv4289
Copy link

@neilalexander Can you please take a look at the code proposed here? We have a way to test this but it requires deleting blk file manually to trigger the recovery code path. Let us know if you have a better idea to write the test.

@MauriceVanVeen
Copy link
Member

MauriceVanVeen commented Apr 11, 2025

Thank you for the contribution!

Have opened #6789, to include some tests proving the server does the right thing with this behavior change.
We included some tests in this PR.

@neilalexander neilalexander reopened this Apr 11, 2025
@MauriceVanVeen MauriceVanVeen changed the title [FIXED] Consumer Sequence ahead of Stream Sequence [FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb Apr 11, 2025
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution (and thanks @MauriceVanVeen for the tests)!

@neilalexander neilalexander merged commit 85bcba7 into nats-io:main Apr 11, 2025
22 checks passed
@sourabhaggrawal
Copy link

Thank you for merging this PR! I really appreciate the prompt review and also the addition of the test case — that was super helpful and thoughtful. 🙌

neilalexander added a commit that referenced this pull request Apr 17, 2025
Includes the following (already cherry-picked) PRs:

- #6587
- #6607
- #6612
- #6609
- #6620
- #6668
- #6674
- #6647
- #6684
- #6691
- #6697
- #6705
- #6706
- #6704
- #6714
- #6720
- #6727
- #6730
- #6726
- #6732
- #6759
- #6753
- #6685
- #6769
- #6777
- #6785
- #6786
- #6778
- #6790
- #6791
- #6798
- #6794
- #6801

Signed-off-by: Neil Twigg <[email protected]>

Signed-off-by: Neil Twigg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consumer Sequence ahead of Stream Sequence
5 participants