Skip to content

Prevent cycle in the QueueManager #5332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 14, 2022
Merged

Prevent cycle in the QueueManager #5332

merged 1 commit into from
Oct 14, 2022

Conversation

style95
Copy link
Member

@style95 style95 commented Oct 10, 2022

Description

This is basically based on #5251
I just added a small change to improve the performance.

Without performance optimization.
image

With performance optimization.
image

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.


private def handleCycle(msg: ActivationMessage)(implicit transid: TransactionId): Unit = {
val action = msg.action
QueuePool.keys.find(_.docInfo.id == action.toDocId) match {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it tries to recover a queue iff there is no reference in QueuePool.
Previously, it always sent a RecoverQueue message whenever the sender of an activation is itself even if there was already a new queue added to the queue pool.

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2022

Codecov Report

Attention: Patch coverage is 79.36508% with 13 lines in your changes missing coverage. Please review.

Project coverage is 76.69%. Comparing base (236ca5e) to head (2e2e640).
Report is 81 commits behind head on master.

Files with missing lines Patch % Lines
.../openwhisk/core/scheduler/queue/QueueManager.scala 79.03% 13 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (236ca5e) and HEAD (2e2e640). Click for more details.

HEAD has 22 uploads less than BASE
Flag BASE (236ca5e) HEAD (2e2e640)
27 5
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5332      +/-   ##
==========================================
- Coverage   81.69%   76.69%   -5.01%     
==========================================
  Files         240      240              
  Lines       14264    14310      +46     
  Branches      598      591       -7     
==========================================
- Hits        11653    10975     -678     
- Misses       2611     3335     +724     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ningyougang
Copy link
Contributor

LGTM

@style95
Copy link
Member Author

style95 commented Oct 11, 2022

I found a regression in this change.
Please do not merge this.

@style95 style95 changed the title Prevent cycle in the QueueManager [WIP] Prevent cycle in the QueueManager Oct 11, 2022
@bdoyle0182
Copy link
Contributor

What is the regression?

@style95
Copy link
Member Author

style95 commented Oct 12, 2022

1. Disable the scheduler

[2022-10-11T09:31:48.791Z] [WARN] [#tid_BOjlyL4vneiwdeJc422InK80fwojMF0e] [FPCSchedulerServer] Scheduler is disabled

2. Shutting down the queue manager

[2022-10-11T09:31:48.802Z] [INFO] [#tid_sid_unknown] [QueueManager] Gracefully shutdown the queue manager

3. Shutting down the queue

[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Running] Gracefully shutdown the memory queue.

4. Queue endpoints are removed

[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/queue/guest/guest/noop0/leader,false,data-management-service,true)
[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/throttling/guest,false,data-management-service,true)
[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/throttling/guest/guest/noop0,false,data-management-service,true)
[2022-10-11T09:31:48.805Z] [INFO] [#tid_sid_unknown] [QueueManager] Remove init revision map cause queue is removed, key: bmt-1480/queue/guest/guest/noop0/leader

5. A queue is in the removing state

It's waiting for buffered activations to be handled.

[2022-10-11T09:31:48.808Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Removing] Queue is going to stop but there are still 0 activations and 17 request buffered.
[2022-10-11T09:31:48.811Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Removing] This queue is shutdown by `/disable` api, do nothing here.

5. New activation comes and it recovers the queue

In the meantime, new activation comes and it tries to recover a queue.

[2022-10-11T09:31:48.807Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] Got activation message b6cec1514fa545828ec1514fa5558251 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from kafka.
[2022-10-11T09:31:48.807Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] [b6cec1514fa545828ec1514fa5558251] the key bmt-1480/queue/guest/guest/noop0/leader is not in the initRevisionMap. revision: 15-5aa72ce8d0070b26994803ac8c22de81
[2022-10-11T09:31:48.807Z] [INFO] [#tid_85ddc3a977a53a14f6a709d503534b92] [QueueManager] Recover a queue for guest/[email protected],

6. All incoming activations trigger the recovery procedure

[2022-10-11T09:31:48.822Z] [INFO] [#tid_c4bc33dcb78ef6f5794af9e424d0c7c8] [QueueManager] Got activation message 77d486061de24c869486061de28c8606 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.822Z] [INFO] [#tid_c4bc33dcb78ef6f5794af9e424d0c7c8] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] Got activation message b6cec1514fa545828ec1514fa5558251 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3168d300edca4fda4e2927be9dd35bf8] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [QueueManager] Got activation message 4864ed562a78408fa4ed562a78308f75 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.823Z] [INFO] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.823Z] [INFO] [#tid_b4f3a44cfe6aac7fd242d03b4aa791c2] [QueueManager] Got activation message 0db275dac7b84458b275dac7b8b45842 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.823Z] [INFO] [#tid_b4f3a44cfe6aac7fd242d03b4aa791c2] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.824Z] [INFO] [#tid_c85be9165297bc619860e1a26cc04649] [QueueManager] Got activation message 3ca322823217472fa322823217072f15 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.824Z] [INFO] [#tid_c85be9165297bc619860e1a26cc04649] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.824Z] [INFO] [#tid_91cb27040fff0411ca0b4bb985973e74] [QueueManager] Got activation message b0e69df6995b4962a69df6995ba96246 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.824Z] [INFO] [#tid_91cb27040fff0411ca0b4bb985973e74] [QueueManager] Recover a queue for guest/[email protected],
[2022-10-11T09:31:48.824Z] [INFO] [#tid_a9fa02e03083ae8a1328a37047e5c6be] [QueueManager] Got activation message 6d5c80bb62eb4a7f9c80bb62eb8a7f92 for Namespace(guest,23bc46b1-71f6-4ed5-8c54-816aa4f8c502)/guest/[email protected] from remote queue manager.
[2022-10-11T09:31:48.824Z] [INFO] [#tid_a9fa02e03083ae8a1328a37047e5c6be] [QueueManager] Recover a queue for guest/[email protected],

7. The queue is recreated

It is also added to the init revision map and other recovery attempts are ignored.

[2022-10-11T09:31:48.884Z] [WARN] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [QueueManager] recreate queue for guest/[email protected]
[2022-10-11T09:31:48.884Z] [INFO] [#tid_sid_unknown] [QueueManager] Update init revision map, key: bmt-1480/queue/guest/guest/noop0/leader, rev: 15-5aa72ce8d0070b26994803ac8c22de81
[2022-10-11T09:31:48.885Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Uninitialized] a new queue is created, retentionTimeout: 60000, kind: nodejs:10.
[2022-10-11T09:31:48.885Z] [INFO] [#tid_a9fa02e03083ae8a1328a37047e5c6be] [QueueManager] Queue for action guest/[email protected] is already recovered, skip

8. Shutdown the queue again

This newly created queue even received an activation but it receives GracefulShutdown again.
And the queue endpoint is also removed.

[2022-10-11T09:31:48.885Z] [INFO] [#tid_3ab104600fa6cc0628d56b4e2c56c52d] [MemoryQueue] [guest:guest/[email protected]:Running] got a new activation message 4864ed562a78408fa4ed562a78308f75
[2022-10-11T09:31:48.885Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Running] Gracefully shutdown the memory queue.
[2022-10-11T09:31:48.885Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(bmt-1480/queue/guest/guest/noop0/leader,false,data-management-service,true)

This is because the scheduler is in the shuttingdown state.

  private def recreateQueue(action: FullyQualifiedEntityName,
                            msg: ActivationMessage,
                            actionMetaData: WhiskActionMetaData): Unit = {
    logging.warn(this, s"recreate queue for ${msg.action}")(msg.transid)
    val queue = createAndStartQueue(msg.user.namespace.name.asString, action, msg.revision, actionMetaData)
    queue ! msg
    msg.transid.mark(this, LoggingMarkers.SCHEDULER_QUEUE_RECOVER)
    if (isShuttingDown) {
      queue ! GracefulShutdown.  // here
    }
  }

9. While the queue is going to stop another queue is created again

And it also shut down the queue again.

[2022-10-11T09:31:48.886Z] [WARN] [#tid_738382b188d8c89f5f517a02c4bdd2c6] [QueueManager] recreate queue for guest/[email protected]
[2022-10-11T09:31:48.888Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [guest:guest/[email protected]:Running] Gracefully shutdown the memory queue.
[2022-10-11T09:31:48.888Z] [INFO] [#tid_sid_unknown] [QueueManager] Remove init revision map cause queue is removed, key: bmt-1480/queue/guest/guest/noop0/leader

10. Shutdown the feed

This problem was repeatedly happening until the feed is also shut down and no more activation was coming.

[2022-10-11T09:31:49.306Z] [INFO] [#tid_sid_unknown] [KafkaConsumerConnector] wakeup consumer for 'creationAck0'
[2022-10-11T09:31:49.306Z] [INFO] [#tid_sid_unknown] [KafkaConsumerConnector] WakeupException happened when do commit action for topic creationAck0
[2022-10-11T09:31:49.307Z] [INFO] [#tid_sid_unknown] [KafkaConsumerConnector] closing consumer for 'creationAck0'

11. Stop the scheduler when there are still activations in the queue

Since no more activation was coming, all created queues (for the same action) were in the removing state and no queue reference was in the QueuePool. Since the current deployment script only checks if there is a queue in the scheduler, it just redeployed the scheduler while there are still activations in the queues.

We can see non-zero current values.

[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: false, avgDuration:Some(1.9665046118449014), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.9665114532673995), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.9665046118449014), stateName:Removing, incoming:0, current:3, existing:0, in-progress:3, stale:3, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.9664630890896282), stateName:Removing, incoming:0, current:3, existing:0, in-progress:3, stale:3, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.109Z] [INFO] [#tid_sid_unknown] [SchedulingDecisionMaker] [queue snapshot] initialized: true, avgDuration:Some(1.966497872446949), stateName:Removing, incoming:0, current:18, existing:0, in-progress:18, stale:18, existingInNs:94, in-progressInNs:81
[2022-10-11T09:31:53.181Z] [INFO] Cluster Node [akka://[email protected]:25520] - Leader is moving node [akka://[email protected]:25520] to [Exiting]
[2022-10-11T09:31:53.183Z] [INFO] Cluster Node [akka://[email protected]:25520] - Exiting completed
[2022-10-11T09:31:53.186Z] [INFO] Cluster Node [akka://[email protected]:25520] - Shutting down...
[2022-10-11T09:31:53.187Z] [INFO] Cluster Node [akka://[email protected]:25520] - Successfully shut down
[2022-10-11T09:31:53.198Z] [INFO] [#tid_sid_unknown] [Scheduler] Shutting down Kamon with coordinated shutdown
[2022-10-11T09:31:53.201Z] [INFO] Shutting down remote daemon.
[2022-10-11T09:31:53.205Z] [INFO] Remote daemon shut down; proceeding with flushing remote transports.

When the scheduler was stopped, there were around 10 different queues for the same actions and 33 activations were in them.

[2022-10-11T09:31:51.309Z]initialized: true, avgDuration:Some(1.9665047142459378), stateName:Removing, incoming:0, current:2, existing:29, in-progress:88, stale:2, existingInNs:70, in-progressInNs:88
[2022-10-11T09:31:51.649Z]initialized: false, avgDuration:Some(1.9665046118449014), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:72, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9664630890896282), stateName:Removing, incoming:0, current:3, existing:0, in-progress:3, stale:3, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.649Z]initialized: true, avgDuration:Some(1.966497872446949), stateName:Removing, incoming:0, current:18, existing:0, in-progress:18, stale:18, existingInNs:72, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9665114532673995), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9664458988566114), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9664526436773975), stateName:Removing, incoming:0, current:1, existing:0, in-progress:1, stale:1, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: false, avgDuration:Some(1.9664874298803876), stateName:Removing, incoming:0, current:2, existing:0, in-progress:2, stale:2, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(1.9665462308159125), stateName:Removing, incoming:0, current:1, existing:0, in-progress:1, stale:1, existingInNs:71, in-progressInNs:88
[2022-10-11T09:31:51.548Z]initialized: true, avgDuration:Some(-48.25), stateName:Removing, incoming:0, current:0, existing:47, in-progress:0, stale:0, existingInNs:71, in-progressInNs:88

@bdoyle0182
Copy link
Contributor

So to summarize the issue, the problem is that it's possible with this change for a scheduler to be stopped with activations still stuck in them even after being disabled? Are you sure that's introduced just with this change?

@style95
Copy link
Member Author

style95 commented Oct 13, 2022

So to summarize the issue, the problem is that it's possible with this change for a scheduler to be stopped with activations still stuck in them even after being disabled? Are you sure that's introduced just with this change?

Previously the queue was only created by controllers. When a scheduler was shut down or a queue was removed for some reason controllers were in charge of creating them again.
So once the scheduler is disabled, no queue was recreated and no above regression happened.

@style95 style95 changed the title [WIP] Prevent cycle in the QueueManager Prevent cycle in the QueueManager Oct 14, 2022
@style95
Copy link
Member Author

style95 commented Oct 14, 2022

I feel it's better to merge this PR and work based on it as the cycle anyway has to be addressed.
Since the regression only happens when the scheduler is disabled, I think it would be ok to merge this for now.

@style95 style95 merged commit ef725a6 into apache:master Oct 14, 2022
@style95 style95 mentioned this pull request Oct 19, 2022
22 tasks
msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants