Skip to content

fix memory queue stuck in removed state edge case #5388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 3, 2023

Conversation

bdoyle0182
Copy link
Contributor

Description

There is an edge case in the queue manager / memory queue where when the memory queue transitions from Idle to Removed, the action can get into an indefinite stuck state if it starts receiving activations again. This is because when transitioning from Idle to Removed, the QueueRemoved message to be sent from the MemoryQueue -> QueueManager is never sent which will remove the entry of the actor from the QueuePool trie map in the QueueManager. If the entry remains in QueuePool, the manager can still forward activations to the child memory queue fsm. The QueueManager receiving QueueRemoved is also responsible for sending QueueRemovedCompleted back to the MemoryQueue fsm which is what will actually make the memory queue stop itself. Since this series of events will never occur when transitioning from Idle to Removed, the Removed state thus becomes dependent on the StateTimeout to occur to actually stop the actor which then will send the QueueRemoved message to the parent to have it removed from the QueuePool. It's important to note that the whole etcd series of events to remove the queue keys are successful and the queue manager acknowledges it with the WatchEndpointRemoved, but that does not remove the entry from QueuePool.

Here is where things get catastrophic. Since the fsm actor entry will remain in the QueuePool trie map and the memory queue never sends the QueueRemoved message, the activations will continue sending to the memory queue IF a new activation comes in the five second window of the default configuration to time out the removed state. Akka fsm state timeouts work such that the timeout message is only sent IF no message is received otherwise the timer is reset every time a new message is received. The case in the removed state if receiving an activation will forward it back to the queue manager which will just send it back to the memory queue creating an indefinite cycle until the activation times out. Now if the action has a multi minute gap from having activations it will self heal because after the activation times out and no more come in, then it will self heal after five seconds. However, if that never happens the action will remain in a stuck state never executing activations until the service is restarted. This makes this bug particularly hard to track down because sometimes it could self heal and sometimes it remains stuck forever.

This StateTimeout behavior resetting the timer on each new message is already actually accounted for in the Flushing state so I've just added the same safe guard here to guarantee self recovery and that the queue will definitely shut down correctly after the stop grace time. However that's just a safeguard, I think there is still a remaining issue to solve that QueueRemoved needs to be sent to the QueueManager somewhere on transition from Idle to Removed, not only once the Removed state times out; otherwise transition from Idle -> Removed will always rely on the StateTimeout for the fsm to stop itself.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@bdoyle0182 bdoyle0182 requested a review from style95 March 25, 2023 19:29
@@ -456,6 +456,9 @@ class MemoryQueue(private val etcdClient: EtcdClient,

// This is not supposed to happen. This will ensure the queue does not run forever.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@style95 I don't think this is actually true since there is no context.parent ! queueRemovedMsg when the Idle state times out which is fairly frequent behavior and thus QueueRemoved is never sent to the manager until Removed state times out which wasn't guaranteed.

@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2023

Codecov Report

Merging #5388 (6747370) into master (60ca660) will decrease coverage by 0.28%.
The diff coverage is 50.00%.

❗ Current head 6747370 differs from pull request most recent head 768a3cb. Consider uploading reports for the commit 768a3cb to get more accurate results

@@            Coverage Diff             @@
##           master    #5388      +/-   ##
==========================================
- Coverage   76.91%   76.64%   -0.28%     
==========================================
  Files         240      240              
  Lines       14588    14590       +2     
  Branches      629      624       -5     
==========================================
- Hits        11221    11182      -39     
- Misses       3367     3408      +41     
Impacted Files Coverage Δ
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala 82.28% <50.00%> (-0.12%) ⬇️

... and 10 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bdoyle0182 bdoyle0182 merged commit fedf022 into apache:master Apr 3, 2023
mtt-merz pushed a commit to mtt-merz/openwhisk that referenced this pull request Oct 22, 2023
Co-authored-by: Brendan Doyle <[email protected]>
(cherry picked from commit fedf022)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants