-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[FIXED] Desync after partial catchup from old leader #6943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
server/jetstream_cluster.go
Outdated
mset.mu.RLock() | ||
lseq := mset.lseq | ||
mset.mu.RUnlock() | ||
if lseq == snap.LastSeq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it can happen but I'm wondering if >=
is safer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
d0f57b3
to
994c846
Compare
Had to remove |
This feels like a step backward though. And I am ok with removing tests, but for very good reasons, this feels like we should sort this out before merging this PR.
|
Signed-off-by: Maurice van Veen <[email protected]>
Re-added The outdated server now doesn't respond at all. And if it's leader still, it steps down. That fixes the above test, because it allows for a new leader to be elected and resolve the desync/make the server current. (This is still assuming a non-desynced leader can be elected.) That makes our current behavior better, instead of knowingly remove data from the second server. We recognize the current leader has lost upper layer state and needs to step down. |
994c846
to
97a807e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When catching up based on a JetStream stream snapshot, the server would not confirm upon receiving EOF that it had received all the data expected by the snapshot. This could result in an old leader responding with EOF too early, the server accepting this as the end of catchup, continuing, and running into
last sequence mismatch
, resetting replication state.Receiving catchup messages from any other server, although less than ideal, should not result in issues. All servers should have agreed on the message contents at any given sequence. The server should recognize it was partially caught up, and retry.
Signed-off-by: Maurice van Veen [email protected]