Add zero downtime deployment #5338

style95 · 2022-10-19T09:00:39Z

Description

This is to deploy openwhisk without downtime on the same nodes.
Also this will address the regression mentioned here.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

bdoyle0182 · 2022-10-19T19:11:31Z

core/controller/src/main/scala/org/apache/openwhisk/core/controller/Controller.scala

+   */
+  protected[controller] val activationStatus = {
+    implicit val executionContext = actorSystem.dispatcher
+    (path("activation") & get) {


would something like /inflightActivations and /inflightActivations/instance make more sense?

This is based on my experience that using simple and more memorable URLs and not using case-sensitive words for APIs is better in many cases.

bdoyle0182 · 2022-10-19T19:15:01Z

...src/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerPool.scala

@@ -413,6 +413,36 @@ class FunctionPullingContainerPool(
      // Reset the prewarmCreateCount value when do expiration check and backfill prewarm if possible
      prewarmCreateFailedCount.set(0)
      adjustPrewarmedContainer(false, true)
+
+    case StatusQuery =>


I think we can combine this part with my PR #5334

That looks great, I would apply it.

bdoyle0182 · 2022-10-19T19:17:38Z

core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/FPCSchedulerServer.scala

-        } ~ (path(FPCSchedulerServer.queuePathPrefix / "status") & get) {
-          complete {
-            scheduler.getQueueStatusData.map(s => s.toJson)
+        } ~ (path("activation" / "count") & get) {


nice this will be great to have for the schedulers. One other thing we need to address for true zero downtime is kafka for the scheduler since activations could get stuck waiting on a specific scheduler topic while the scheduler is being restarted.

Currently, that is handled in this way.

When a scheduler is disabled, it will first remove its endpoint from ETCD.

Controllers no longer send any new requests to the scheduler as there is no endpoint.

Still, messages in Kafka can be delivered to the scheduler because the consumer does not shut down.

Before the number of in-flight activation becomes 0, the scheduler is not deployed.

Once all activations are handled, the scheduler will be redeployed.

bdoyle0182 · 2022-10-19T19:22:36Z

core/controller/src/main/scala/org/apache/openwhisk/core/controller/Controller.scala

+   *
+   * @return running activation
+   */
+  protected[controller] val activationStatus = {


can we make it so that GracefulShutdown of the controller actor system requires in flight activations to go down to zero. I think that should simplify external monitoring to determine whether it's safe to shut down so that checking these routes aren't a necessity of the deployment monitor scripting

Yes, that is handled by https://github.com/apache/openwhisk/pull/5338/files#diff-379956107a78f18274d3895bbbb268b964cc5f89ac3601333c4255cc0bb7632dR380

It will use the number of in-flight activations.

codecov-commenter · 2022-10-20T05:12:48Z

Codecov Report

Attention: Patch coverage is 38.75969% with 79 lines in your changes missing coverage. Please review.

Project coverage is 76.30%. Comparing base (ef725a6) to head (e7a100b).
Report is 79 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala	33.33%	12 Missing ⚠️
.../apache/openwhisk/core/controller/Controller.scala	42.10%	11 Missing ⚠️
...ontainerpool/v2/FunctionPullingContainerPool.scala	9.09%	10 Missing ⚠️
...ache/openwhisk/core/invoker/FPCInvokerServer.scala	28.57%	10 Missing ⚠️
...pache/openwhisk/core/invoker/InvokerReactive.scala	0.00%	7 Missing ⚠️
...he/openwhisk/core/invoker/FPCInvokerReactive.scala	0.00%	6 Missing ⚠️
.../openwhisk/core/scheduler/FPCSchedulerServer.scala	60.00%	6 Missing ⚠️
...apache/openwhisk/core/service/WatcherService.scala	0.00%	5 Missing ⚠️
.../core/containerpool/v2/ActivationClientProxy.scala	70.58%	5 Missing ⚠️
...enwhisk/core/loadBalancer/CommonLoadBalancer.scala	0.00%	2 Missing ⚠️
... and 3 more

❗ There is a different number of reports uploaded between BASE (ef725a6) and HEAD (e7a100b). Click for more details.

HEAD has 40 uploads less than BASE

Flag BASE (ef725a6) HEAD (e7a100b)

45 5

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5338      +/-   ##
==========================================
- Coverage   81.66%   76.30%   -5.37%     
==========================================
  Files         240      240              
  Lines       14313    14374      +61     
  Branches      610      596      -14     
==========================================
- Hits        11689    10968     -721     
- Misses       2624     3406     +782

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

style95 · 2022-10-26T10:00:15Z

core/invoker/src/main/scala/org/apache/openwhisk/core/invoker/Invoker.scala

@@ -250,7 +251,7 @@ trait InvokerCore {
  def disable(): String
  def isEnabled(): String
  def backfillPrewarm(): Route
-  def status(): Future[Map[String, List[String]]]
+  def getPoolState(): Future[Either[NotSupportedPoolState, TotalContainerPoolState]]


This is one small change from the original codes.
I wanted to isolate Akka-HTTP dependency on the invoker server layer.
So each InvokerReactive just returns non-Akka-HTTP-related data.
The invoker server is only dependent on Akka-HTTP.

bdoyle0182 · 2022-10-26T16:37:49Z

core/invoker/src/main/scala/org/apache/openwhisk/core/invoker/FPCInvokerServer.scala

+            complete {
+              invoker.getPoolState().map {
+                case Right(poolState) =>
+                  poolState.totalContainers.toJson.compactPrint


the total containers in the container pool state object includes prewarm containers so not sure if you want that included in this route when determining if the invoker is cleared unless disabling immediately tears down prewarms than it's probably fine. If that isn't adequate, I think what you want for determining if the invoker is safe to shut down is summing the totalContainers values of the busyPool and warmedPool

In scala-2.13 mapValues returns a MapView, and it cannot be cast to Map by default.

style95

I think this is ready to review.

I would test the zero downtime deployment again.

style95 · 2022-10-29T08:17:40Z

core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/MemoryQueue.scala

@@ -154,8 +168,8 @@ class MemoryQueue(private val etcdClient: EtcdClient,
  private val staleQueueRemovedMsg = QueueRemoved(invocationNamespace, action.toDocId.asDocInfo(revision), None)
  private val actionRetentionTimeout = MemoryQueue.getRetentionTimeout(actionMetaData, queueConfig)

-  private[queue] var containers = Set.empty[String]
-  private[queue] var creationIds = Set.empty[String]
+  private[queue] var containers = java.util.concurrent.ConcurrentHashMap.newKeySet[String]().asScala


This was one of the reasons for failed activations during the deployment.
The Set was not thread-safe.

This was a concern for me in the past with all of the vars in MemoryQueue. Are we sure this covers all var cases now to be thread safe?

I'm trying to debug an issue where etcd keys are orphaned for running containers (not warmed) that do not exist when I disable a scheduler and do a rolling deployment. Can reproduce it anytime I disable a scheduler right now. It's a new issue and think it may have been introduced with the cycle sending commit ef725a6; I'm wondering if this actually fixes it.

Hm.. first, the etcd keys for running containers are handled by invokers so it's surprising you face the issue when you do a rolling deployment of schedulers. I think you need to trace the problematic container in the invoker.

I haven't looked into it deeply yet but if there are any issues, it could be because of this though I haven't faced any so far.
Not sure if there are any easily replaceable alternatives.

private var requestBuffer = mutable.PriorityQueue.empty[BufferedRequest]

Others are basically immutable collections, so I think it would be ok to use them in the context of Akka actor.
If this really needs to be updated, I hope it is handled in a subsequent PR.

style95 · 2022-10-29T10:09:58Z

It's ready to merge.
No error happened during the deployment.

ningyougang · 2022-10-31T01:05:31Z

core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/FPCSchedulerServer.scala

@@ -79,7 +85,7 @@ object FPCSchedulerServer {

  private val schedulerUsername = loadConfigOrThrow[String](ConfigKeys.whiskSchedulerUsername)
  private val schedulerPassword = loadConfigOrThrow[String](ConfigKeys.whiskSchedulerPassword)
-  private val queuePathPrefix = "queue"
+  private val queuePathPrefix = "queues"


Because the queuePathPrefix is changed, the first depoyment supports zero downtime deployment?

The scheduler deployment no longer uses the number of queues.
It uses the number of activations instead.
https://github.com/apache/openwhisk/pull/5338/files#diff-168a0313e0b46601cf796cd58db3a422a426015b91de821de1951bc88c62db7cR297

When we deploy this feature first time, there has no activation/count API in old component, so the request against old component would be failed using ansible script, on this case, what will happen?

I got your point. This is a fair point.
Since the zero downtime deployment was not supported in the upstream version, I think there would be no issue with it.

It only matters in our downstream. In our downstream, I think we need some manual operations.
Since all APIs for each component are different from our downstream, I think we need to manually disable each component and wait until they are ready before deploying a component.

ningyougang · 2022-10-31T01:07:51Z

A small question: Add zero downtime deployment #5338 (comment)
If deploy one component by one component in VM, it is ok
- But if have a lot of components, it may consume a lot of time for deployment.
- So, can deploy several components at the same time like using old half solution?
If we want to deployment in k8s env, has any change in openwhisk-deploy-kube repo?

style95 · 2022-10-31T01:43:52Z

So, can deploy several components at the same time like using old half solution?

Yes, I just deployed them one by one to minimize the impact of the deployment.
For example, if we deploy two schedulers at once, there is only one scheduler in my environment.
So all queues will be created in one scheduler. It's ok as I used moderate loads, but anyway, I wanted to avoid such a case.
And we can deploy multiple components at once as long as we reverse the order at the second deployment.

If we want to deployment in k8s env, has any change in openwhisk-deploy-kube repo?

When we deploy OW on K8S, we generally use blue/green deployment with a new helm release, so I think it would be ok.

bdoyle0182 · 2022-10-31T01:53:47Z

Just have one question, otherwise LGTM

ningyougang · 2022-10-31T01:56:05Z

LGTM with a small question: #5338 (comment)

style95 · 2022-10-31T08:43:20Z

I would merge this tomorrow.
Please leave any comments.

* Deploy controllers without downtime * Deploy invokers without downtime * Deploy schedulers without downtime * Fix typo * Fix typo * Add a disable API to controllers * Remove unnecessary steps * Add more logs for container liveness * Change Set to thread-safe one * Use the transaction ID of the activation * Gracefully shutdown activation client proxy * Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala Apply suggestion Co-authored-by: Brendan Doyle <[email protected]> * Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala Apply suggestion Co-authored-by: Brendan Doyle <[email protected]> * Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala Co-authored-by: Brendan Doyle <[email protected]> * Apply apache#5334 * Remove akka-http dependency from the invoker reactive * Exclude the prewarm containers count from the /pool/count route * Add missing import * Make it compatible with scala-2.13 In scala-2.13 mapValues returns a MapView, and it cannot be cast to Map by default. * Fix test cases * Add container id to the logs of ActivationClientProxy Co-authored-by: Brendan Doyle <[email protected]>

style95 added 3 commits October 18, 2022 16:45

Deploy controllers without downtime

c7d04ed

Deploy invokers without downtime

8310d54

Deploy schedulers without downtime

a49c0a3

style95 added the wip label Oct 19, 2022

bdoyle0182 reviewed Oct 19, 2022

View reviewed changes

style95 force-pushed the add-zero-downtime-deployment branch 2 times, most recently from e2219df to 8444e58 Compare October 20, 2022 04:51

style95 force-pushed the add-zero-downtime-deployment branch 3 times, most recently from eac1c79 to eabb2bc Compare October 21, 2022 07:17

Fix typo

edab778

style95 force-pushed the add-zero-downtime-deployment branch 7 times, most recently from cddb631 to 37d6172 Compare October 21, 2022 11:04

Fix typo

e76938b

style95 force-pushed the add-zero-downtime-deployment branch from 37d6172 to e76938b Compare October 21, 2022 11:25

style95 added 2 commits October 22, 2022 14:31

Add a disable API to controllers

cfe96fa

Remove unnecessary steps

ec626d8

style95 force-pushed the add-zero-downtime-deployment branch from c8471fc to ec626d8 Compare October 22, 2022 07:01

style95 added 3 commits October 22, 2022 18:20

Add more logs for container liveness

a7c359a

Change Set to thread-safe one

6e6d07d

Use the transaction ID of the activation

9b0704f

Remove akka-http dependency from the invoker reactive

9540ad5

style95 commented Oct 26, 2022

View reviewed changes

bdoyle0182 reviewed Oct 26, 2022

View reviewed changes

style95 added 3 commits October 27, 2022 16:52

Exclude the prewarm containers count from the /pool/count route

c182975

Add missing import

fd208a3

Make it compatible with scala-2.13

c39dcd3

In scala-2.13 mapValues returns a MapView, and it cannot be cast to Map by default.

style95 force-pushed the add-zero-downtime-deployment branch from 44f4baa to 0d4ed5b Compare October 29, 2022 07:20

Fix test cases

bad5904

style95 force-pushed the add-zero-downtime-deployment branch from 0d4ed5b to bad5904 Compare October 29, 2022 08:07

style95 changed the title ~~[WIP] Add zero downtime deployment~~ Add zero downtime deployment Oct 29, 2022

Add container id to the logs of ActivationClientProxy

e7a100b

style95 commented Oct 29, 2022

View reviewed changes

style95 requested review from ningyougang, jiangpengcheng, upgle and KeonHee October 29, 2022 10:10

ningyougang reviewed Oct 31, 2022

View reviewed changes

style95 removed the wip label Oct 31, 2022

bdoyle0182 approved these changes Oct 31, 2022

View reviewed changes

ningyougang approved these changes Oct 31, 2022

View reviewed changes

upgle approved these changes Oct 31, 2022

View reviewed changes

style95 merged commit 651a2e9 into apache:master Nov 1, 2022

style95 mentioned this pull request Nov 2, 2022

Handle container cleanup from ActivationClient shutdown gracefully #5348

Merged

22 tasks

Add zero downtime deployment #5338

Add zero downtime deployment #5338

Uh oh!

Conversation

style95 commented Oct 19, 2022

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

bdoyle0182 Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

style95 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

style95 commented Oct 29, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

style95 Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ningyougang Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ningyougang commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

style95 commented Oct 31, 2022

Uh oh!

bdoyle0182 commented Oct 31, 2022

Uh oh!

ningyougang commented Oct 31, 2022

Uh oh!

style95 commented Oct 31, 2022

Uh oh!

Uh oh!

bdoyle0182 Oct 19, 2022 •

edited

Loading

codecov-commenter commented Oct 20, 2022 •

edited

Loading

style95 Oct 31, 2022 •

edited

Loading

ningyougang Oct 31, 2022 •

edited

Loading

ningyougang commented Oct 31, 2022 •

edited

Loading