fix memory queue stuck in removed state edge case #5388
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
There is an edge case in the queue manager / memory queue where when the memory queue transitions from Idle to Removed, the action can get into an indefinite stuck state if it starts receiving activations again. This is because when transitioning from Idle to Removed, the
QueueRemoved
message to be sent from the MemoryQueue -> QueueManager is never sent which will remove the entry of the actor from theQueuePool
trie map in the QueueManager. If the entry remains inQueuePool
, the manager can still forward activations to the child memory queue fsm. The QueueManager receivingQueueRemoved
is also responsible for sendingQueueRemovedCompleted
back to the MemoryQueue fsm which is what will actually make the memory queue stop itself. Since this series of events will never occur when transitioning from Idle to Removed, the Removed state thus becomes dependent on theStateTimeout
to occur to actually stop the actor which then will send theQueueRemoved
message to the parent to have it removed from theQueuePool
. It's important to note that the whole etcd series of events to remove the queue keys are successful and the queue manager acknowledges it with theWatchEndpointRemoved
, but that does not remove the entry fromQueuePool
.Here is where things get catastrophic. Since the fsm actor entry will remain in the
QueuePool
trie map and the memory queue never sends the QueueRemoved message, the activations will continue sending to the memory queue IF a new activation comes in the five second window of the default configuration to time out the removed state. Akka fsm state timeouts work such that the timeout message is only sent IF no message is received otherwise the timer is reset every time a new message is received. The case in the removed state if receiving an activation will forward it back to the queue manager which will just send it back to the memory queue creating an indefinite cycle until the activation times out. Now if the action has a multi minute gap from having activations it will self heal because after the activation times out and no more come in, then it will self heal after five seconds. However, if that never happens the action will remain in a stuck state never executing activations until the service is restarted. This makes this bug particularly hard to track down because sometimes it could self heal and sometimes it remains stuck forever.This StateTimeout behavior resetting the timer on each new message is already actually accounted for in the
Flushing
state so I've just added the same safe guard here to guarantee self recovery and that the queue will definitely shut down correctly after the stop grace time. However that's just a safeguard, I think there is still a remaining issue to solve thatQueueRemoved
needs to be sent to theQueueManager
somewhere on transition fromIdle
toRemoved
, not only once theRemoved
state times out; otherwise transition fromIdle
->Removed
will always rely on the StateTimeout for the fsm to stop itself.Related issue and scope
My changes affect the following components
Types of changes
Checklist: