-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add zero downtime deployment #5338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
*/ | ||
protected[controller] val activationStatus = { | ||
implicit val executionContext = actorSystem.dispatcher | ||
(path("activation") & get) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would something like /inflightActivations
and /inflightActivations/instance
make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on my experience that using simple and more memorable URLs and not using case-sensitive words for APIs is better in many cases.
@@ -413,6 +413,36 @@ class FunctionPullingContainerPool( | |||
// Reset the prewarmCreateCount value when do expiration check and backfill prewarm if possible | |||
prewarmCreateFailedCount.set(0) | |||
adjustPrewarmedContainer(false, true) | |||
|
|||
case StatusQuery => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can combine this part with my PR #5334
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great, I would apply it.
} ~ (path(FPCSchedulerServer.queuePathPrefix / "status") & get) { | ||
complete { | ||
scheduler.getQueueStatusData.map(s => s.toJson) | ||
} ~ (path("activation" / "count") & get) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice this will be great to have for the schedulers. One other thing we need to address for true zero downtime is kafka for the scheduler since activations could get stuck waiting on a specific scheduler topic while the scheduler is being restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, that is handled in this way.
- When a scheduler is disabled, it will first remove its endpoint from ETCD.
- Controllers no longer send any new requests to the scheduler as there is no endpoint.
- Still, messages in Kafka can be delivered to the scheduler because the consumer does not shut down.
- Before the number of in-flight activation becomes 0, the scheduler is not deployed.
- Once all activations are handled, the scheduler will be redeployed.
* | ||
* @return running activation | ||
*/ | ||
protected[controller] val activationStatus = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make it so that GracefulShutdown of the controller actor system requires in flight activations to go down to zero. I think that should simplify external monitoring to determine whether it's safe to shut down so that checking these routes aren't a necessity of the deployment monitor scripting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is handled by https://github.com/apache/openwhisk/pull/5338/files#diff-379956107a78f18274d3895bbbb268b964cc5f89ac3601333c4255cc0bb7632dR380
It will use the number of in-flight activations.
e2219df
to
8444e58
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5338 +/- ##
==========================================
- Coverage 81.66% 76.30% -5.37%
==========================================
Files 240 240
Lines 14313 14374 +61
Branches 610 596 -14
==========================================
- Hits 11689 10968 -721
- Misses 2624 3406 +782 ☔ View full report in Codecov by Sentry. |
eac1c79
to
eabb2bc
Compare
cddb631
to
37d6172
Compare
37d6172
to
e76938b
Compare
c8471fc
to
ec626d8
Compare
@@ -250,7 +251,7 @@ trait InvokerCore { | |||
def disable(): String | |||
def isEnabled(): String | |||
def backfillPrewarm(): Route | |||
def status(): Future[Map[String, List[String]]] | |||
def getPoolState(): Future[Either[NotSupportedPoolState, TotalContainerPoolState]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one small change from the original codes.
I wanted to isolate Akka-HTTP dependency on the invoker server layer.
So each InvokerReactive just returns non-Akka-HTTP-related data.
The invoker server is only dependent on Akka-HTTP.
complete { | ||
invoker.getPoolState().map { | ||
case Right(poolState) => | ||
poolState.totalContainers.toJson.compactPrint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the total containers in the container pool state object includes prewarm containers so not sure if you want that included in this route when determining if the invoker is cleared unless disabling immediately tears down prewarms than it's probably fine. If that isn't adequate, I think what you want for determining if the invoker is safe to shut down is summing the totalContainers
values of the busyPool
and warmedPool
In scala-2.13 mapValues returns a MapView, and it cannot be cast to Map by default.
44f4baa
to
0d4ed5b
Compare
0d4ed5b
to
bad5904
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready to review.
I would test the zero downtime deployment again.
@@ -154,8 +168,8 @@ class MemoryQueue(private val etcdClient: EtcdClient, | |||
private val staleQueueRemovedMsg = QueueRemoved(invocationNamespace, action.toDocId.asDocInfo(revision), None) | |||
private val actionRetentionTimeout = MemoryQueue.getRetentionTimeout(actionMetaData, queueConfig) | |||
|
|||
private[queue] var containers = Set.empty[String] | |||
private[queue] var creationIds = Set.empty[String] | |||
private[queue] var containers = java.util.concurrent.ConcurrentHashMap.newKeySet[String]().asScala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was one of the reasons for failed activations during the deployment.
The Set
was not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a concern for me in the past with all of the vars
in MemoryQueue. Are we sure this covers all var cases now to be thread safe?
I'm trying to debug an issue where etcd keys are orphaned for running containers (not warmed) that do not exist when I disable a scheduler and do a rolling deployment. Can reproduce it anytime I disable a scheduler right now. It's a new issue and think it may have been introduced with the cycle sending commit ef725a6; I'm wondering if this actually fixes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. first, the etcd keys for running containers are handled by invokers so it's surprising you face the issue when you do a rolling deployment of schedulers. I think you need to trace the problematic container in the invoker.
I haven't looked into it deeply yet but if there are any issues, it could be because of this though I haven't faced any so far.
Not sure if there are any easily replaceable alternatives.
private var requestBuffer = mutable.PriorityQueue.empty[BufferedRequest]
Others are basically immutable collections, so I think it would be ok to use them in the context of Akka actor.
If this really needs to be updated, I hope it is handled in a subsequent PR.
@@ -79,7 +85,7 @@ object FPCSchedulerServer { | |||
|
|||
private val schedulerUsername = loadConfigOrThrow[String](ConfigKeys.whiskSchedulerUsername) | |||
private val schedulerPassword = loadConfigOrThrow[String](ConfigKeys.whiskSchedulerPassword) | |||
private val queuePathPrefix = "queue" | |||
private val queuePathPrefix = "queues" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the queuePathPrefix is changed, the first depoyment supports zero downtime deployment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler deployment no longer uses the number of queues.
It uses the number of activations instead.
https://github.com/apache/openwhisk/pull/5338/files#diff-168a0313e0b46601cf796cd58db3a422a426015b91de821de1951bc88c62db7cR297
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we deploy this feature first time, there has no activation/count
API in old component, so the request against old component would be failed using ansible script, on this case, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point. This is a fair point.
Since the zero downtime deployment was not supported in the upstream version, I think there would be no issue with it.
It only matters in our downstream. In our downstream, I think we need some manual operations.
Since all APIs for each component are different from our downstream, I think we need to manually disable each component and wait until they are ready before deploying a component.
|
Yes, I just deployed them one by one to minimize the impact of the deployment.
When we deploy OW on K8S, we generally use blue/green deployment with a new helm release, so I think it would be ok. |
Just have one question, otherwise LGTM |
LGTM with a small question: #5338 (comment) |
I would merge this tomorrow. |
* Deploy controllers without downtime * Deploy invokers without downtime * Deploy schedulers without downtime * Fix typo * Fix typo * Add a disable API to controllers * Remove unnecessary steps * Add more logs for container liveness * Change Set to thread-safe one * Use the transaction ID of the activation * Gracefully shutdown activation client proxy * Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala Apply suggestion Co-authored-by: Brendan Doyle <[email protected]> * Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala Apply suggestion Co-authored-by: Brendan Doyle <[email protected]> * Update core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala Co-authored-by: Brendan Doyle <[email protected]> * Apply apache#5334 * Remove akka-http dependency from the invoker reactive * Exclude the prewarm containers count from the /pool/count route * Add missing import * Make it compatible with scala-2.13 In scala-2.13 mapValues returns a MapView, and it cannot be cast to Map by default. * Fix test cases * Add container id to the logs of ActivationClientProxy Co-authored-by: Brendan Doyle <[email protected]>
Description
This is to deploy openwhisk without downtime on the same nodes.
Also this will address the regression mentioned here.
Related issue and scope
My changes affect the following components
Types of changes
Checklist: