-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
(2.12) NRG: Don't respond success to catchup messages & ignore response if not leader #6944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c10a3e9
to
648bf8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - One minor naming nit
2a943cf
to
20a5d6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in principle but TestNRGSnapshotAndTruncateToApplied
looks unhappy.
Yep, seems some thinks broke since the last push. Put in draft for now. |
20a5d6f
to
344604c
Compare
344604c
to
7fa9e1b
Compare
Signed-off-by: Maurice van Veen <[email protected]>
Signed-off-by: Maurice van Veen <[email protected]>
7fa9e1b
to
a7b6481
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Desync was observed in the Raft layer in the following scenario:
term=1
and proposes 100 items,n.pterm=1
.term=2
and proposes 1 item, stores it in its log but it doesn't get quorum,n.pterm=2
.term=3
, catches up Server 3 for its missing messages from step 1 and gets quorum/commits them. Server 3 does not know yet these can be committed.n.pterm=2, n.pindex=51
and Server 3 allows it to become leader because it hasn.pterm=1, n.pindex=100
. Server 3 votes because thepterm
is higher, even though thepindex
is lower.This happened because Server 3 provided quorum to Server 1 during catchup. The issue is resolved if we simply don't respond back to Server 1 and make it commit those entries. We need to fully catchup, and only afterward can we participate in achieving quorum on those entries.
This fixes some cases of desync, for example stream desync with
last sequence mismatch
.Signed-off-by: Maurice van Veen [email protected]