Handle container cleanup from ActivationClient shutdown gracefully #5348

style95 · 2022-11-02T12:50:56Z

Description

This is to fix the regression introduced from #5338

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

style95

I would test if it is working as expected and see if I can add test cases covering these changes tomorrow.

style95 · 2022-11-02T12:56:17Z

...nvoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala

          c.activationClient.close().andThen {
-            case _ => self ! ClientClosed
+            case _ =>
+              context.parent ! FailureMessage(new RuntimeException(errorMsg))


As GracefulShutdown is introduced in the ContainerProxy layer, generally a clean-up process is initiated by ContainerProxy. ContainerProxy cleans up ETCD data first and sends GracefulShutdown to ActivationClientPRoxy and waits for the ClientClosed message.
But in some cases like this, there is a need for ActivationClientProxy to initiate the clean-up process.
In these cases, ActivationClientProxy should let ContainerProxy(parent) know the situation and let it start the clean-up process immediately while it also shut down.

So now, we need to send a FailureMessage to the parent, then the parent will clean up ETCD data and remove containers immediately by this kind of logic.
https://github.com/apache/openwhisk/pull/5348/files#diff-23fc1c1634cd8a2e99b4cfbf342527248a53f5987911af80ab5d910ce7864d70R368

style95 · 2022-11-02T12:59:33Z

...rc/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerProxy.scala

@@ -272,8 +272,7 @@ class FunctionPullingContainerProxy(
              job.rpcPort,
              container.containerId)) match {
            case Success(clientProxy) =>
-              clientProxy ! StartClient


Previously, this chaining of the future causes the timing issue.
As a result, we needed this case.

case Event(ClientCreationCompleted(proxy), _: NonexistentData) => akka.pattern.after(3.milliseconds, actorSystem.scheduler) { self ! ClientCreationCompleted(proxy.orElse(Some(sender()))) Future.successful({}) }

https://github.com/apache/openwhisk/pull/5348/files#diff-23fc1c1634cd8a2e99b4cfbf342527248a53f5987911af80ab5d910ce7864d70L341

It had made the logic indeterministic and less efficient, so I refactored it.

style95 · 2022-11-02T13:00:35Z

...rc/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerProxy.scala

@@ -334,41 +335,27 @@ class FunctionPullingContainerProxy(

  when(CreatingClient) {
    // wait for client creation when cold start
-    case Event(job: ContainerCreatedData, _: NonexistentData) =>
-      stay() using job
+    case Event(job: InitializedData, _) =>


Now, it sends the StartClient to the ActivationClientProxy only after it receives the InitializedData.
So there would be no timing issue.

style95 · 2022-11-02T13:01:05Z

...rc/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerProxy.scala

@@ -518,6 +505,8 @@ class FunctionPullingContainerProxy(
        data.action.fullyQualifiedName(withVersion = true),
        data.action.rev,
        Some(data.clientProxy))
+
+    case x: Event if x.event != PingCache => delay


This will make sure GracefulShutdown is properly handled in all states.

My concern with having a stash message for all events and then unstash it on state transition, is that it was the root cause of the edge case with the orphaned etcd data for pausing / unpausing containers where a generic FailureMessage would be stashed and then when it transitioned to Running it would unstash the FailureMessage leading to the bug / unexpected behavior.

I just want to make sure that we're 100% sure a catch all here will not have any unknown side effects.

This state is a temporary state before removing containers and the proxy itself.
So the next state is always Removing as all cases in this state lead to it and both cases clean up etcd data properly.

In the Removing state, there is no case of moving back to other states.
It stays or stops.

when(Removing, unusedTimeout) { // only if ClientProxy is closed, ContainerProxy stops. So it is important for ClientProxy to send ClientClosed. case Event(ClientClosed, _) => stop() // even if any error occurs, it still waits for ClientClosed event in order to be stopped after the client is closed. case Event(t: FailureMessage, _) => logging.error(this, s"unable to delete a container due to ${t}") stay case Event(StateTimeout, _) => logging.error(this, s"could not receive ClientClosed for ${unusedTimeout}, so just stop the container proxy.") stop() case Event(Remove | GracefulShutdown, _) => stay() case Event(DetermineKeepContainer(_), _) => stay() }

So I suppose it would be OK to delay.

style95 · 2022-11-02T13:03:26Z

...nvoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala

@@ -36,7 +36,7 @@ import scala.concurrent.Future
 import scala.util.{Success, Try}

 // Event send by the actor
-case class ClientCreationCompleted(client: Option[ActorRef] = None)
+case object ClientCreationCompleted


As we store the reference of a client proxy after creation, we don't need to pass/receive the client-proxy reference via this message. So it became an object.

https://github.com/apache/openwhisk/pull/5348/files#diff-23fc1c1634cd8a2e99b4cfbf342527248a53f5987911af80ab5d910ce7864d70R315

style95 · 2022-11-02T13:05:24Z

...duler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/SchedulingDecisionMaker.scala

@@ -96,7 +96,9 @@ class SchedulingDecisionMaker(
              this,
              s"there is no capacity activations will be dropped or throttled, (availableMsg: $availableMsg totalContainers: $totalContainers, limit: $limit, namespaceContainers: ${existingContainerCountInNs}, namespaceInProgressContainer: ${inProgressContainerCountInNs}) [$invocationNamespace:$action]")
            Future.successful(DecisionResults(EnableNamespaceThrottling(dropMsg = totalContainers == 0), 0))
-          case NamespaceThrottled if schedulingConfig.allowOverProvisionBeforeThrottle && ceiling(limit * schedulingConfig.namespaceOverProvisionBeforeThrottleRatio) - existingContainerCountInNs - inProgressContainerCountInNs > 0 =>
+          case NamespaceThrottled
+              if schedulingConfig.allowOverProvisionBeforeThrottle && ceiling(


It seems our CI tests do not run checkScalaFmtAll.
Some codes already included in the master branch are incorrectly formatted.

codecov-commenter · 2022-11-02T14:57:48Z

Codecov Report

Attention: Patch coverage is 86.95652% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.35%. Comparing base (077fb6d) to head (5bf63a5).
Report is 74 commits behind head on master.

Files with missing lines	Patch %	Lines
.../core/containerpool/v2/ActivationClientProxy.scala	90.00%	1 Missing ⚠️
...ntainerpool/v2/FunctionPullingContainerProxy.scala	90.00%	1 Missing ⚠️
...k/core/containerpool/v2/InvokerHealthManager.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5348      +/-   ##
==========================================
- Coverage   81.04%   76.35%   -4.69%     
==========================================
  Files         240      240              
  Lines       14391    14393       +2     
  Branches      605      603       -2     
==========================================
- Hits        11663    10990     -673     
- Misses       2728     3403     +675

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bdoyle0182 · 2022-11-02T19:25:08Z

...nvoker/src/main/scala/org/apache/openwhisk/core/containerpool/v2/ActivationClientProxy.scala

          c.activationClient.close().andThen {
-            case _ => self ! ClientClosed
+            case _ =>


I was also seeing the issue of orphaned data for a test action that I frequently update every minute outside of the scheduler deployments. Does it stand to reason this was really the same issue and this will also fix that issue?

The issue started happening I want to say around the same time of the October 13th commit that introduced the regression so it seems likely to me.

I think so.
As I stated here, when the ActivationClientProxy initiates the clean-up process, it didn't let the ContainerProxy know. The ETCD data clean-up is handled by ContainerProxy, accordingly, the data was not removed.

bdoyle0182 · 2022-11-02T19:35:07Z

Just have two comments to address, otherwise I think it LGTM!

bdoyle0182 · 2022-11-02T19:49:34Z

nit: can we rename the pr to handle container cleanup from ActivationClient shutdown gracefully

bdoyle0182 · 2022-11-02T21:47:13Z

I ran a test with the changes here and can no longer reproduce the issue!

style95 · 2022-11-03T09:40:30Z

I confirmed it supports zero downtime deployment.

No activation is failed during the deployment.

bdoyle0182 · 2022-11-04T00:21:25Z

LGTM

…pache#5348) * Fix the regression * Apply scalaFmt * Fix test cases * Make the MemoryQueueTests stable * Make the ActivationClientProxyTests stable

style95 added 2 commits November 2, 2022 21:49

Fix the regression

fe4aa39

Apply scalaFmt

997f8a9

style95 changed the title ~~Cleanup container data~~ Fix regression Nov 2, 2022

style95 commented Nov 2, 2022

View reviewed changes

bdoyle0182 reviewed Nov 2, 2022

View reviewed changes

style95 changed the title ~~Fix regression~~ Handle container cleanup from ActivationClient shutdown gracefully Nov 3, 2022

style95 added 2 commits November 3, 2022 11:01

Fix test cases

0c2ba53

Make the MemoryQueueTests stable

452e1df

bdoyle0182 approved these changes Nov 4, 2022

View reviewed changes

Make the ActivationClientProxyTests stable

5bf63a5

upgle approved these changes Nov 4, 2022

View reviewed changes

style95 merged commit 44791f3 into apache:master Nov 4, 2022

Handle container cleanup from ActivationClient shutdown gracefully #5348

Handle container cleanup from ActivationClient shutdown gracefully #5348

Uh oh!

Conversation

style95 commented Nov 2, 2022

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

style95 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

style95 Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bdoyle0182 Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdoyle0182 commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdoyle0182 commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdoyle0182 commented Nov 2, 2022

Uh oh!

style95 commented Nov 3, 2022

Uh oh!

bdoyle0182 commented Nov 4, 2022

Uh oh!

Uh oh!

style95 left a comment •

edited

Loading

style95 Nov 2, 2022 •

edited

Loading

codecov-commenter commented Nov 2, 2022 •

edited

Loading

bdoyle0182 Nov 2, 2022 •

edited

Loading

bdoyle0182 commented Nov 2, 2022 •

edited

Loading

bdoyle0182 commented Nov 2, 2022 •

edited

Loading