Skip to content

[FIXED] Stuck consumer after leader change #6469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 7, 2025

Conversation

MauriceVanVeen
Copy link
Member

When a client requests for messages, in o.deliverMsg the following is done:

	// Update delivered first.
	o.updateDelivered(dseq, seq, dc, ts)

	// Send message.
	o.outq.send(pmsg)

For o.updateDelivered we need to establish quorum to have all servers know the message was delivered, and this can fail. So we would have sent the client messages that a new leader will not know have ever been delivered.

Once a new leader gets elected and receives an ACK for a message it doesn't know was delivered it would move o.sseq ahead. This is incorrect, since the message is not in o.pending it results in ack floors not being updated. And if any messages before the one that was acked were not acknowledged/NAK-ed, that would mean these messages would never be redelivered resulting in the stuck consumer symptom.

Signed-off-by: Maurice van Veen [email protected]

@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner February 7, 2025 11:12
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@derekcollison derekcollison merged commit cd93fef into main Feb 7, 2025
5 checks passed
@derekcollison derekcollison deleted the maurice/stuck-consumer-after-leader-change branch February 7, 2025 15:22
neilalexander added a commit that referenced this pull request Feb 10, 2025
Includes the following:

- #6465
- #6464
- #6469
- #6471
- #6472
- #6474
- #6477
- #6480
- #6487
- #6488

Signed-off-by: Neil Twigg <[email protected]>
neilalexander added a commit that referenced this pull request Apr 17, 2025
Related to #6469, about the
following code:
```go
	// Update delivered first.
	o.updateDelivered(dseq, seq, dc, ts)

	// Send message.
	o.outq.send(pmsg)
```

`o.updateDelivered` requires proposing delivered state through Raft, and
even if proposing fails, we immediately sent the message to the client.
This is great for performance, but really bad for properly replicating
this piece of data. Before the before-mentioned PR there would be a
bunch of nasty side-effects of stuck consumer, perceived data loss
through missed redeliveries, etc. Because clients could get messages
that a new leader wouldn't know about if proposals failed.

The core issue is that we should only send the message AFTER we had
quorum on updating delivered state. Otherwise the following could
happen: message gets sent to the client, `updateDelivered` proposal
fails, leader changes, `AckSync` will now timeout indefinitely even with
retries because the new leader doesn't know this message was even
delivered.

Signed-off-by: Maurice van Veen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants