Skip to content

(2.12) NRG: Don't respond success to catchup messages & ignore response if not leader #6944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 6, 2025

Conversation

MauriceVanVeen
Copy link
Member

@MauriceVanVeen MauriceVanVeen commented Jun 5, 2025

Desync was observed in the Raft layer in the following scenario:

  1. Server 1 becomes leader for term=1 and proposes 100 items, n.pterm=1.
  2. Server 2 and 3 only receive the first 50 items and store them in their logs. Server 1 will not know yet these can be committed.
  3. Server 2 becomes leader for term=2 and proposes 1 item, stores it in its log but it doesn't get quorum, n.pterm=2.
  4. Server 1 becomes leader for term=3, catches up Server 3 for its missing messages from step 1 and gets quorum/commits them. Server 3 does not know yet these can be committed.
  5. Server 2 only has 51 items in its log, the initial 50 items from Server 1, and the one item it tried to propose. It tries to become leader with n.pterm=2, n.pindex=51 and Server 3 allows it to become leader because it has n.pterm=1, n.pindex=100. Server 3 votes because the pterm is higher, even though the pindex is lower.
  6. Server 2 truncates part of Server 3's log (that Server 1 already committed). This results in desync.

This happened because Server 3 provided quorum to Server 1 during catchup. The issue is resolved if we simply don't respond back to Server 1 and make it commit those entries. We need to fully catchup, and only afterward can we participate in achieving quorum on those entries.

This fixes some cases of desync, for example stream desync with last sequence mismatch.

Signed-off-by: Maurice van Veen [email protected]

@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner June 5, 2025 14:43
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-reelect-desync branch from c10a3e9 to 648bf8f Compare June 5, 2025 14:44
@MauriceVanVeen MauriceVanVeen changed the title NRG: Truncate down to committed on step down NRG: Truncate non-quorum entries from leader's term on step down Jun 5, 2025
@MauriceVanVeen MauriceVanVeen changed the title NRG: Truncate non-quorum entries from leader's term on step down (2.12) NRG: Truncate non-quorum entries from leader's term on step down Jun 5, 2025
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - One minor naming nit

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-reelect-desync branch 2 times, most recently from 2a943cf to 20a5d6f Compare June 5, 2025 15:59
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in principle but TestNRGSnapshotAndTruncateToApplied looks unhappy.

@MauriceVanVeen MauriceVanVeen marked this pull request as draft June 5, 2025 16:50
@MauriceVanVeen
Copy link
Member Author

Yep, seems some thinks broke since the last push. Put in draft for now.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-reelect-desync branch from 20a5d6f to 344604c Compare June 5, 2025 20:13
@MauriceVanVeen MauriceVanVeen changed the title (2.12) NRG: Truncate non-quorum entries from leader's term on step down (2.12) NRG: Don't respond success to catchup messages & ignore response if not leader Jun 5, 2025
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-reelect-desync branch from 344604c to 7fa9e1b Compare June 5, 2025 20:16
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-reelect-desync branch from 7fa9e1b to a7b6481 Compare June 6, 2025 09:28
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review June 6, 2025 10:14
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@neilalexander neilalexander merged commit a1f267d into main Jun 6, 2025
69 of 70 checks passed
@neilalexander neilalexander deleted the maurice/nrg-reelect-desync branch June 6, 2025 10:23
neilalexander added a commit that referenced this pull request Jun 17, 2025
Includes the following:

- #6922
- #6931
- #6933
- #6934
- #6939
- #6938
- #6940
- #6941
- #6942
- #6943
- #6945
- #6944
- #6947
- #6948
- #6949
- #6956
- #6960
- #6961
- #6951
- #6965
- #6968
- #6981
- #6983
- #6984

Signed-off-by: Neil Twigg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants