fix(pubsub): set default stream ACK deadline to subscriptions' #9268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #9252.
This PR mitigates the reported issue by using the subscription's max ACK deadline when establishing a streaming pull, instead of the previous optimistic ACK deadline of 10 seconds.
(when creating a streaming pull, the ACK timing histogram has no data yet, thus its 99 % value defaults to 10 seconds... also see the commit message for some extra info)
Additionally, it adds a debug log message with the ACK deadline used for the streaming pull, as that important piece of information was missing previously.
How to test
Run the example from the issue description, and verify that the backend does not send any duplicate messages, and the subscription's ACK deadline is respected (300 s in that particular case).
Details
The cause of the bug was that the client overrode the max ACK deadline on the subscription with the default minimum deadline of 10 seconds.
This became problematic for the received messages that exceeded the
FlowControl.max_messages
setting, and were thus put on hold in a buffer. Since these messages were not lease-managed, and their ACK deadlines were not being extended, the overridden ACK deadline of 10 seconds was easily hit, resulting in re-delivery.There are several options on how to deal with these on hold messages:
Discard them immediately, and set their ACK deadlines to 0.
This would cause the backend to re-send them immediately, which would work well with multiple subscribers. However, if using a single subscriber, low
max_messages
setting, and processing each message takes a lot of time, large published message batches would cause a lot of unnecessary traffic.The excessive messages in a batch (those above
max_messages
) would be discarded and re-delivered over and over again, because each time the client would dispatch only a small portion of the batch to user's callbacks.Add all received messages to lease management, including those put on hold.
This would avoid ACK timeouts and message re-delivery, but could shoot the leaser's load well over
1.0
. When user code would process and acknowledge some of the messages, the client would remove them from the lease management, but the load could still stay above1.0
Since the leaser would still be overloaded, no additional messages would be dispatched to the callbacks, and the stream would not be resumed --> deadlock (there would be nothing to trigger resuming the streaming pull).
While there could be a way around that, it would likely require rewriting significant chunks of the streaming pull logic, risking new bugs.
Ignoring the messages on hold until the load allows processing them.
This is the path chosen. Only the messages that are not put on hold are added to the lease management, while the rest sit in the buffer, waiting to be released when the load allows for it. If they are released soon enough, the auto-lease management steps in.
While these messages could still remain in the buffer for too long, missing the ACK deadline, those messages would be re-delivered to other subscribers, as the overloaded subscriber's stream would be paused in the meantime (the subscriber does not resume the stream, if it has any messages on hold).
In the case of a single subscriber client, the messages would indeed be re-delivered to it, but on the other hand the ACK histogram times would pick that up and gradually extend the ACK deadlines that the auto-leaser sets for the messages.