Prevent cycle in the QueueManager #5332

style95 · 2022-10-10T04:15:18Z

Description

This is basically based on #5251
I just added a small change to improve the performance.

Without performance optimization.

With performance optimization.

Related issue and scope

I opened an issue to propose and discuss this change ([New Scheduler] it's rarely happened but there may be an infinity loop in scheduler #5250)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

style95 · 2022-10-10T04:19:01Z

core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/QueueManager.scala

+
+  private def handleCycle(msg: ActivationMessage)(implicit transid: TransactionId): Unit = {
+    val action = msg.action
+    QueuePool.keys.find(_.docInfo.id == action.toDocId) match {


Now it tries to recover a queue iff there is no reference in QueuePool.
Previously, it always sent a RecoverQueue message whenever the sender of an activation is itself even if there was already a new queue added to the queue pool.

codecov-commenter · 2022-10-10T04:46:39Z

Codecov Report

Attention: Patch coverage is 79.36508% with 13 lines in your changes missing coverage. Please review.

Project coverage is 76.69%. Comparing base (236ca5e) to head (2e2e640).
Report is 81 commits behind head on master.

Files with missing lines	Patch %	Lines
.../openwhisk/core/scheduler/queue/QueueManager.scala	79.03%	13 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (236ca5e) and HEAD (2e2e640). Click for more details.

HEAD has 22 uploads less than BASE

Flag BASE (236ca5e) HEAD (2e2e640)

27 5

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5332      +/-   ##
==========================================
- Coverage   81.69%   76.69%   -5.01%     
==========================================
  Files         240      240              
  Lines       14264    14310      +46     
  Branches      598      591       -7     
==========================================
- Hits        11653    10975     -678     
- Misses       2611     3335     +724

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ningyougang · 2022-10-10T05:02:57Z

LGTM

style95 · 2022-10-11T12:14:43Z

I found a regression in this change.
Please do not merge this.

bdoyle0182 · 2022-10-11T16:24:32Z

What is the regression?

style95 · 2022-10-12T01:46:26Z

1. Disable the scheduler

[2022-10-11T09:31:48.791Z] [WARN] [#tid_BOjlyL4vneiwdeJc422InK80fwojMF0e] [FPCSchedulerServer] Scheduler is disabled

2. Shutting down the queue manager

[2022-10-11T09:31:48.802Z] [INFO] [#tid_sid_unknown] [QueueManager] Gracefully shutdown the queue manager

3. Shutting down the queue

[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Running] Gracefully shutdown the memory queue.

4. Queue endpoints are removed

[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/queue/guest/guest/noop0/leader,false,data-management-service,true)
[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/throttling/guest,false,data-management-service,true)
[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/throttling/guest/guest/noop0,false,data-management-service,true)
[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [QueueManager] Remove init revision map cause queue is removed, key: bmt-1480/queue/guest/guest/noop0/leader

5. A queue is in the removing state

It's waiting for buffered activations to be handled.

[2022-10-11T09:31:48.808Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Removing] Queue is going to stop but there are still 0 activations and 17 request buffered.
[2022-10-11T09:31:48.811Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Removing] This queue is shutdown by `/disable` api, do nothing here.

5. New activation comes and it recovers the queue

In the meantime, new activation comes and it tries to recover a queue.

[2022-10-11T09:31:48.807Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] Got activation message b6cec1514fa545828ec1514fa5558251 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from kafka.
[2022-10-11T09:31:48.807Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] [b6cec1514fa545828ec1514fa5558251] the key bmt-1480/queue/guest/guest/noop0/leader is not in the initRevisionMap. revision: 15-5aa72ce8d0070b26994803ac8c22de81
[2022-10-11T09:31:48.807Z] [INFO] [#tid_85ddc3a977a53a14f6a709d503534b92] [QueueManager] Recover a queue for guest/[email protected],

6. All incoming activations trigger the recovery procedure

[2022-10-11T09:31:48.822Z] [INFO] [#tid_c4bc33dcb78ef6f5794af9e424d0c7c8] [QueueManager] Got activation message 77d486061de24c869486061de28c8606 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.822Z] [INFO] [#tid_c4bc33dcb78ef6f5794af9e424d0c7c8] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] Got activation message b6cec1514fa545828ec1514fa5558251 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [QueueManager] Got activation message 4864ed562a78408fa4ed562a78308f75 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.823Z] [INFO] [#tid_b4f3a44cfe6aac7fd242d03b4aa791c2] [QueueManager] Got activation message 0db275dac7b84458b275dac7b8b45842 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.823Z] [INFO] [#tid_b4f3a44cfe6aac7fd242d03b4aa791c2] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.824Z] [INFO] [#tid_c85be9165297bc619860e1a26cc04649] [QueueManager] Got activation message 3ca322823217472fa322823217072f15 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.824Z] [INFO] [#tid_c85be9165297bc619860e1a26cc04649] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.824Z] [INFO] [#tid_91cb27040fff0411ca0b4bb985973e74] [QueueManager] Got activation message b0e69df6995b4962a69df6995ba96246 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.824Z] [INFO] [#tid_91cb27040fff0411ca0b4bb985973e74] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.824Z] [INFO] [#tid_a9fa02e03083ae8a1328a37047e5c6be] [QueueManager] Got activation message 6d5c80bb62eb4a7f9c80bb62eb8a7f92 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.824Z] [INFO] [#tid_a9fa02e03083ae8a1328a37047e5c6be] [QueueManager] Recover a queue for guest/[email protected],

7. The queue is recreated

It is also added to the init revision map and other recovery attempts are ignored.

[2022-10-11T09:31:48.884Z] [WARN] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [QueueManager] recreate queue for guest/[email protected]
[2022-10-11T09:31:48.884Z] [INFO] [#tid_sid_unknown] [QueueManager] Update init revision map, key: bmt-1480/queue/guest/guest/noop0/leader, rev: 15-5aa72ce8d0070b26994803ac8c22de81
[2022-10-11T09:31:48.885Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Uninitialized] a new queue is created, retentionTimeout: 60000, kind: nodejs:10.
[2022-10-11T09:31:48.885Z] [INFO] [#tid_a9fa02e03083ae8a1328a37047e5c6be] [QueueManager] Queue for action guest/[email protected] is already recovered, skip

8. Shutdown the queue again

This newly created queue even received an activation but it receives GracefulShutdown again.
And the queue endpoint is also removed.

[2022-10-11T09:31:48.885Z] [INFO] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [MemoryQueue] [guest:guest/[email protected]:Running] got a new activation message 4864ed562a78408fa4ed562a78308f75
[2022-10-11T09:31:48.885Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Running] Gracefully shutdown the memory queue.
[2022-10-11T09:31:48.885Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/queue/guest/guest/noop0/leader,false,data-management-service,true)

This is because the scheduler is in the shuttingdown state.

  private def recreateQueue(action: FullyQualifiedEntityName,
                            msg: ActivationMessage,
                            actionMetaData: WhiskActionMetaData): Unit = {
    logging.warn(this, s"recreate queue for ${msg.action}")(msg.transid)
    val queue = createAndStartQueue(msg.user.namespace.name.asString, action, msg.revision, actionMetaData)
    queue ! msg
    msg.transid.mark(this, LoggingMarkers.SCHEDULER_QUEUE_RECOVER)
    if (isShuttingDown) {
      queue ! GracefulShutdown.  // here
    }
  }

9. While the queue is going to stop another queue is created again

And it also shut down the queue again.

[2022-10-11T09:31:48.886Z] [WARN] [#tid_738382b188d8c89f5f517a02c4bdd2c6] [QueueManager] recreate queue for guest/[email protected]
[2022-10-11T09:31:48.888Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Running] Gracefully shutdown the memory queue.
[2022-10-11T09:31:48.888Z] [INFO] [#tid_sid_unknown] [QueueManager] Remove init revision map cause queue is removed, key: bmt-1480/queue/guest/guest/noop0/leader

10. Shutdown the feed

This problem was repeatedly happening until the feed is also shut down and no more activation was coming.

[2022-10-11T09:31:49.306Z] [INFO] [#tid_sid_unknown] [KafkaConsumerConnector] wakeup consumer for 'creationAck0'
[2022-10-11T09:31:49.306Z] [INFO] [#tid_sid_unknown] [KafkaConsumerConnector] WakeupException happened when do commit action for topic creationAck0
[2022-10-11T09:31:49.307Z] [INFO] [#tid_sid_unknown] [KafkaConsumerConnector] closing consumer for 'creationAck0'

11. Stop the scheduler when there are still activations in the queue

Since no more activation was coming, all created queues (for the same action) were in the removing state and no queue reference was in the QueuePool. Since the current deployment script only checks if there is a queue in the scheduler, it just redeployed the scheduler while there are still activations in the queues.

We can see non-zero current values.

[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: false, avgDuration:Some(1.9665046118449014), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.9665114532673995), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.9665046118449014), stateName:Removing, incoming:0, current:3, existing:0, in-progress:3, stale:3, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.9664630890896282), stateName:Removing, incoming:0, current:3, existing:0, in-progress:3, stale:3, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.966497872446949), stateName:Removing, incoming:0, current:18, existing:0, in-progress:18, stale:18, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.181Z] [INFO] Cluster Node [akka://[email protected]:25520] - Leader is moving node [akka://[email protected]:25520] to [Exiting]
[2022-10-11T09:31:53.183Z] [INFO] Cluster Node [akka://[email protected]:25520] - Exiting completed
[2022-10-11T09:31:53.186Z] [INFO] Cluster Node [akka://[email protected]:25520] - Shutting down...
[2022-10-11T09:31:53.187Z] [INFO] Cluster Node [akka://[email protected]:25520] - Successfully shut down
[2022-10-11T09:31:53.198Z] [INFO] [#tid_sid_unknown] [Scheduler] Shutting down Kamon with coordinated shutdown
[2022-10-11T09:31:53.201Z] [INFO] Shutting down remote daemon.
[2022-10-11T09:31:53.205Z] [INFO] Remote daemon shut down; proceeding with flushing remote transports.

When the scheduler was stopped, there were around 10 different queues for the same actions and 33 activations were in them.

[2022-10-11T09:31:51.309Z]initialized: true, avgDuration:Some(1.9665047142459378), stateName:Removing, incoming:0, current:2, existing:29, in-progress:88, stale:2, existingInNs:70, in-progressInNs:88
[2022-10-11T09:31:51.649Z]initialized: false, avgDuration:Some(1.9665046118449014), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:72, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9664630890896282), stateName:Removing, incoming:0, current:3, existing:0, in-progress:3, stale:3, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.649Z]initialized: true, avgDuration:Some(1.966497872446949), stateName:Removing, incoming:0, current:18, existing:0, in-progress:18, stale:18, existingInNs:72, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9665114532673995), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9664458988566114), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9664526436773975), stateName:Removing, incoming:0, current:1, existing:0, in-progress:1, stale:1, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: false, avgDuration:Some(1.9664874298803876), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9665462308159125), stateName:Removing, incoming:0, current:1, existing:0, in-progress:1, stale:1, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(-48.25), stateName:Removing, incoming:0, current:0, existing:47, in-progress:0, stale:0, existingInNs:71, in-progressInNs:88

bdoyle0182 · 2022-10-12T16:51:29Z

So to summarize the issue, the problem is that it's possible with this change for a scheduler to be stopped with activations still stuck in them even after being disabled? Are you sure that's introduced just with this change?

style95 · 2022-10-13T02:22:30Z

So to summarize the issue, the problem is that it's possible with this change for a scheduler to be stopped with activations still stuck in them even after being disabled? Are you sure that's introduced just with this change?

Previously the queue was only created by controllers. When a scheduler was shut down or a queue was removed for some reason controllers were in charge of creating them again.
So once the scheduler is disabled, no queue was recreated and no above regression happened.

style95 · 2022-10-14T02:07:59Z

I feel it's better to merge this PR and work based on it as the cycle anyway has to be addressed.
Since the regression only happens when the scheduler is disabled, I think it would be ok to merge this for now.

Prevent cycle in the QueueManager

2e2e640

style95 commented Oct 10, 2022

View reviewed changes

ningyougang approved these changes Oct 10, 2022

View reviewed changes

style95 changed the title ~~Prevent cycle in the QueueManager~~ [WIP] Prevent cycle in the QueueManager Oct 11, 2022

style95 changed the title ~~[WIP] Prevent cycle in the QueueManager~~ Prevent cycle in the QueueManager Oct 14, 2022

style95 merged commit ef725a6 into apache:master Oct 14, 2022

style95 mentioned this pull request Oct 19, 2022

Add zero downtime deployment #5338

Merged

22 tasks

msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022

Prevent cycle in the QueueManager (apache#5332)

7a69b58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent cycle in the QueueManager #5332

Prevent cycle in the QueueManager #5332

Uh oh!

style95 commented Oct 10, 2022

Uh oh!

style95 Oct 10, 2022

Uh oh!

codecov-commenter commented Oct 10, 2022 •

edited

Loading

Uh oh!

ningyougang commented Oct 10, 2022

Uh oh!

style95 commented Oct 11, 2022

Uh oh!

bdoyle0182 commented Oct 11, 2022

Uh oh!

style95 commented Oct 12, 2022

Uh oh!

bdoyle0182 commented Oct 12, 2022

Uh oh!

style95 commented Oct 13, 2022

Uh oh!

style95 commented Oct 14, 2022

Uh oh!

Uh oh!

Prevent cycle in the QueueManager #5332

Prevent cycle in the QueueManager #5332

Uh oh!

Conversation

style95 commented Oct 10, 2022

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

style95 Oct 10, 2022

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ningyougang commented Oct 10, 2022

Uh oh!

style95 commented Oct 11, 2022

Uh oh!

bdoyle0182 commented Oct 11, 2022

Uh oh!

style95 commented Oct 12, 2022

1. Disable the scheduler

2. Shutting down the queue manager

3. Shutting down the queue

4. Queue endpoints are removed

5. A queue is in the removing state

5. New activation comes and it recovers the queue

6. All incoming activations trigger the recovery procedure

7. The queue is recreated

8. Shutdown the queue again

9. While the queue is going to stop another queue is created again

10. Shutdown the feed

11. Stop the scheduler when there are still activations in the queue

Uh oh!

bdoyle0182 commented Oct 12, 2022

Uh oh!

style95 commented Oct 13, 2022

Uh oh!

style95 commented Oct 14, 2022

Uh oh!

Uh oh!

codecov-commenter commented Oct 10, 2022 •

edited

Loading