Skip to content

Add zero downtime deployment #5338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Nov 1, 2022

Conversation

style95
Copy link
Member

@style95 style95 commented Oct 19, 2022

Description

This is to deploy openwhisk without downtime on the same nodes.
Also this will address the regression mentioned here.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@style95 style95 added the wip label Oct 19, 2022
*/
protected[controller] val activationStatus = {
implicit val executionContext = actorSystem.dispatcher
(path("activation") & get) {
Copy link
Contributor

@bdoyle0182 bdoyle0182 Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would something like /inflightActivations and /inflightActivations/instance make more sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on my experience that using simple and more memorable URLs and not using case-sensitive words for APIs is better in many cases.

@@ -413,6 +413,36 @@ class FunctionPullingContainerPool(
// Reset the prewarmCreateCount value when do expiration check and backfill prewarm if possible
prewarmCreateFailedCount.set(0)
adjustPrewarmedContainer(false, true)

case StatusQuery =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can combine this part with my PR #5334

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great, I would apply it.

} ~ (path(FPCSchedulerServer.queuePathPrefix / "status") & get) {
complete {
scheduler.getQueueStatusData.map(s => s.toJson)
} ~ (path("activation" / "count") & get) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice this will be great to have for the schedulers. One other thing we need to address for true zero downtime is kafka for the scheduler since activations could get stuck waiting on a specific scheduler topic while the scheduler is being restarted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, that is handled in this way.

  1. When a scheduler is disabled, it will first remove its endpoint from ETCD.
  2. Controllers no longer send any new requests to the scheduler as there is no endpoint.
  3. Still, messages in Kafka can be delivered to the scheduler because the consumer does not shut down.
  4. Before the number of in-flight activation becomes 0, the scheduler is not deployed.
  5. Once all activations are handled, the scheduler will be redeployed.

*
* @return running activation
*/
protected[controller] val activationStatus = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it so that GracefulShutdown of the controller actor system requires in flight activations to go down to zero. I think that should simplify external monitoring to determine whether it's safe to shut down so that checking these routes aren't a necessity of the deployment monitor scripting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@style95 style95 force-pushed the add-zero-downtime-deployment branch 2 times, most recently from e2219df to 8444e58 Compare October 20, 2022 04:51
@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2022

Codecov Report

Attention: Patch coverage is 38.75969% with 79 lines in your changes missing coverage. Please review.

Project coverage is 76.30%. Comparing base (ef725a6) to head (e7a100b).
Report is 79 commits behind head on master.

Files with missing lines Patch % Lines
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala 33.33% 12 Missing ⚠️
.../apache/openwhisk/core/controller/Controller.scala 42.10% 11 Missing ⚠️
...ontainerpool/v2/FunctionPullingContainerPool.scala 9.09% 10 Missing ⚠️
...ache/openwhisk/core/invoker/FPCInvokerServer.scala 28.57% 10 Missing ⚠️
...pache/openwhisk/core/invoker/InvokerReactive.scala 0.00% 7 Missing ⚠️
...he/openwhisk/core/invoker/FPCInvokerReactive.scala 0.00% 6 Missing ⚠️
.../openwhisk/core/scheduler/FPCSchedulerServer.scala 60.00% 6 Missing ⚠️
...apache/openwhisk/core/service/WatcherService.scala 0.00% 5 Missing ⚠️
.../core/containerpool/v2/ActivationClientProxy.scala 70.58% 5 Missing ⚠️
...enwhisk/core/loadBalancer/CommonLoadBalancer.scala 0.00% 2 Missing ⚠️
... and 3 more

❗ There is a different number of reports uploaded between BASE (ef725a6) and HEAD (e7a100b). Click for more details.

HEAD has 40 uploads less than BASE
Flag BASE (ef725a6) HEAD (e7a100b)
45 5
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5338      +/-   ##
==========================================
- Coverage   81.66%   76.30%   -5.37%     
==========================================
  Files         240      240              
  Lines       14313    14374      +61     
  Branches      610      596      -14     
==========================================
- Hits        11689    10968     -721     
- Misses       2624     3406     +782     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@style95 style95 force-pushed the add-zero-downtime-deployment branch 3 times, most recently from eac1c79 to eabb2bc Compare October 21, 2022 07:17
@style95 style95 force-pushed the add-zero-downtime-deployment branch 7 times, most recently from cddb631 to 37d6172 Compare October 21, 2022 11:04
@style95 style95 force-pushed the add-zero-downtime-deployment branch from 37d6172 to e76938b Compare October 21, 2022 11:25
@style95 style95 force-pushed the add-zero-downtime-deployment branch from c8471fc to ec626d8 Compare October 22, 2022 07:01
@@ -250,7 +251,7 @@ trait InvokerCore {
def disable(): String
def isEnabled(): String
def backfillPrewarm(): Route
def status(): Future[Map[String, List[String]]]
def getPoolState(): Future[Either[NotSupportedPoolState, TotalContainerPoolState]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one small change from the original codes.
I wanted to isolate Akka-HTTP dependency on the invoker server layer.
So each InvokerReactive just returns non-Akka-HTTP-related data.
The invoker server is only dependent on Akka-HTTP.

complete {
invoker.getPoolState().map {
case Right(poolState) =>
poolState.totalContainers.toJson.compactPrint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the total containers in the container pool state object includes prewarm containers so not sure if you want that included in this route when determining if the invoker is cleared unless disabling immediately tears down prewarms than it's probably fine. If that isn't adequate, I think what you want for determining if the invoker is safe to shut down is summing the totalContainers values of the busyPool and warmedPool

@style95 style95 force-pushed the add-zero-downtime-deployment branch from 44f4baa to 0d4ed5b Compare October 29, 2022 07:20
@style95 style95 force-pushed the add-zero-downtime-deployment branch from 0d4ed5b to bad5904 Compare October 29, 2022 08:07
@style95 style95 changed the title [WIP] Add zero downtime deployment Add zero downtime deployment Oct 29, 2022
Copy link
Member Author

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to review.

I would test the zero downtime deployment again.

@@ -154,8 +168,8 @@ class MemoryQueue(private val etcdClient: EtcdClient,
private val staleQueueRemovedMsg = QueueRemoved(invocationNamespace, action.toDocId.asDocInfo(revision), None)
private val actionRetentionTimeout = MemoryQueue.getRetentionTimeout(actionMetaData, queueConfig)

private[queue] var containers = Set.empty[String]
private[queue] var creationIds = Set.empty[String]
private[queue] var containers = java.util.concurrent.ConcurrentHashMap.newKeySet[String]().asScala
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of the reasons for failed activations during the deployment.
The Set was not thread-safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a concern for me in the past with all of the vars in MemoryQueue. Are we sure this covers all var cases now to be thread safe?

I'm trying to debug an issue where etcd keys are orphaned for running containers (not warmed) that do not exist when I disable a scheduler and do a rolling deployment. Can reproduce it anytime I disable a scheduler right now. It's a new issue and think it may have been introduced with the cycle sending commit ef725a6; I'm wondering if this actually fixes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. first, the etcd keys for running containers are handled by invokers so it's surprising you face the issue when you do a rolling deployment of schedulers. I think you need to trace the problematic container in the invoker.

I haven't looked into it deeply yet but if there are any issues, it could be because of this though I haven't faced any so far.
Not sure if there are any easily replaceable alternatives.

private var requestBuffer = mutable.PriorityQueue.empty[BufferedRequest]

Others are basically immutable collections, so I think it would be ok to use them in the context of Akka actor.
If this really needs to be updated, I hope it is handled in a subsequent PR.

@style95
Copy link
Member Author

style95 commented Oct 29, 2022

It's ready to merge.
No error happened during the deployment.
image

@@ -79,7 +85,7 @@ object FPCSchedulerServer {

private val schedulerUsername = loadConfigOrThrow[String](ConfigKeys.whiskSchedulerUsername)
private val schedulerPassword = loadConfigOrThrow[String](ConfigKeys.whiskSchedulerPassword)
private val queuePathPrefix = "queue"
private val queuePathPrefix = "queues"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the queuePathPrefix is changed, the first depoyment supports zero downtime deployment?

Copy link
Member Author

@style95 style95 Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler deployment no longer uses the number of queues.
It uses the number of activations instead.
https://github.com/apache/openwhisk/pull/5338/files#diff-168a0313e0b46601cf796cd58db3a422a426015b91de821de1951bc88c62db7cR297

Copy link
Contributor

@ningyougang ningyougang Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we deploy this feature first time, there has no activation/count API in old component, so the request against old component would be failed using ansible script, on this case, what will happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. This is a fair point.
Since the zero downtime deployment was not supported in the upstream version, I think there would be no issue with it.

It only matters in our downstream. In our downstream, I think we need some manual operations.
Since all APIs for each component are different from our downstream, I think we need to manually disable each component and wait until they are ready before deploying a component.

@ningyougang
Copy link
Contributor

ningyougang commented Oct 31, 2022

  • A small question: Add zero downtime deployment #5338 (comment)
  • If deploy one component by one component in VM, it is ok
    • But if have a lot of components, it may consume a lot of time for deployment.
    • So, can deploy several components at the same time like using old half solution?
  • If we want to deployment in k8s env, has any change in openwhisk-deploy-kube repo?

@style95 style95 removed the wip label Oct 31, 2022
@style95
Copy link
Member Author

style95 commented Oct 31, 2022

  • So, can deploy several components at the same time like using old half solution?

Yes, I just deployed them one by one to minimize the impact of the deployment.
For example, if we deploy two schedulers at once, there is only one scheduler in my environment.
So all queues will be created in one scheduler. It's ok as I used moderate loads, but anyway, I wanted to avoid such a case.
And we can deploy multiple components at once as long as we reverse the order at the second deployment.

  • If we want to deployment in k8s env, has any change in openwhisk-deploy-kube repo?

When we deploy OW on K8S, we generally use blue/green deployment with a new helm release, so I think it would be ok.

@bdoyle0182
Copy link
Contributor

Just have one question, otherwise LGTM

@ningyougang
Copy link
Contributor

LGTM with a small question: #5338 (comment)

@style95
Copy link
Member Author

style95 commented Oct 31, 2022

I would merge this tomorrow.
Please leave any comments.

@style95 style95 merged commit 651a2e9 into apache:master Nov 1, 2022
msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022
* Deploy controllers without downtime

* Deploy invokers without downtime

* Deploy schedulers without downtime

* Fix typo

* Fix typo

* Add a disable API to controllers

* Remove unnecessary steps

* Add more logs for container liveness

* Change Set to thread-safe one

* Use the transaction ID of the activation

* Gracefully shutdown activation client proxy

* Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala

Apply suggestion

Co-authored-by: Brendan Doyle <[email protected]>

* Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala

Apply suggestion

Co-authored-by: Brendan Doyle <[email protected]>

* Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala

Co-authored-by: Brendan Doyle <[email protected]>

* Apply apache#5334

* Remove akka-http dependency from the invoker reactive

* Exclude the prewarm containers count from the /pool/count route

* Add missing import

* Make it compatible with scala-2.13

In scala-2.13 mapValues returns a MapView, and it cannot be cast to Map by default.

* Fix test cases

* Add container id to the logs of ActivationClientProxy

Co-authored-by: Brendan Doyle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants