-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Handle container cleanup from ActivationClient shutdown gracefully #5348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would test if it is working as expected and see if I can add test cases covering these changes tomorrow.
c.activationClient.close().andThen { | ||
case _ => self ! ClientClosed | ||
case _ => | ||
context.parent ! FailureMessage(new RuntimeException(errorMsg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As GracefulShutdown
is introduced in the ContainerProxy layer, generally a clean-up process is initiated by ContainerProxy. ContainerProxy cleans up ETCD data first and sends GracefulShutdown to ActivationClientPRoxy and waits for the ClientClosed
message.
But in some cases like this, there is a need for ActivationClientProxy to initiate the clean-up process.
In these cases, ActivationClientProxy should let ContainerProxy(parent) know the situation and let it start the clean-up process immediately while it also shut down.
So now, we need to send a FailureMessage
to the parent, then the parent will clean up ETCD data and remove containers immediately by this kind of logic.
https://github.com/apache/openwhisk/pull/5348/files#diff-23fc1c1634cd8a2e99b4cfbf342527248a53f5987911af80ab5d910ce7864d70R368
@@ -272,8 +272,7 @@ class FunctionPullingContainerProxy( | |||
job.rpcPort, | |||
container.containerId)) match { | |||
case Success(clientProxy) => | |||
clientProxy ! StartClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this chaining of the future causes the timing issue.
As a result, we needed this case.
case Event(ClientCreationCompleted(proxy), _: NonexistentData) =>
akka.pattern.after(3.milliseconds, actorSystem.scheduler) {
self ! ClientCreationCompleted(proxy.orElse(Some(sender())))
Future.successful({})
}
It had made the logic indeterministic and less efficient, so I refactored it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great
@@ -334,41 +335,27 @@ class FunctionPullingContainerProxy( | |||
|
|||
when(CreatingClient) { | |||
// wait for client creation when cold start | |||
case Event(job: ContainerCreatedData, _: NonexistentData) => | |||
stay() using job | |||
case Event(job: InitializedData, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, it sends the StartClient
to the ActivationClientProxy only after it receives the InitializedData
.
So there would be no timing issue.
@@ -518,6 +505,8 @@ class FunctionPullingContainerProxy( | |||
data.action.fullyQualifiedName(withVersion = true), | |||
data.action.rev, | |||
Some(data.clientProxy)) | |||
|
|||
case x: Event if x.event != PingCache => delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make sure GracefulShutdown
is properly handled in all states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with having a stash message for all events and then unstash it on state transition, is that it was the root cause of the edge case with the orphaned etcd data for pausing / unpausing containers where a generic FailureMessage
would be stashed and then when it transitioned to Running it would unstash the FailureMessage
leading to the bug / unexpected behavior.
I just want to make sure that we're 100% sure a catch all here will not have any unknown side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This state is a temporary state before removing containers and the proxy itself.
So the next state is always Removing
as all cases in this state lead to it and both cases clean up etcd data properly.
In the Removing
state, there is no case of moving back to other states.
It stays
or stops
.
when(Removing, unusedTimeout) {
// only if ClientProxy is closed, ContainerProxy stops. So it is important for ClientProxy to send ClientClosed.
case Event(ClientClosed, _) =>
stop()
// even if any error occurs, it still waits for ClientClosed event in order to be stopped after the client is closed.
case Event(t: FailureMessage, _) =>
logging.error(this, s"unable to delete a container due to ${t}")
stay
case Event(StateTimeout, _) =>
logging.error(this, s"could not receive ClientClosed for ${unusedTimeout}, so just stop the container proxy.")
stop()
case Event(Remove | GracefulShutdown, _) =>
stay()
case Event(DetermineKeepContainer(_), _) =>
stay()
}
So I suppose it would be OK to delay.
@@ -36,7 +36,7 @@ import scala.concurrent.Future | |||
import scala.util.{Success, Try} | |||
|
|||
// Event send by the actor | |||
case class ClientCreationCompleted(client: Option[ActorRef] = None) | |||
case object ClientCreationCompleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we store the reference of a client proxy after creation, we don't need to pass/receive the client-proxy reference via this message. So it became an object.
@@ -96,7 +96,9 @@ class SchedulingDecisionMaker( | |||
this, | |||
s"there is no capacity activations will be dropped or throttled, (availableMsg: $availableMsg totalContainers: $totalContainers, limit: $limit, namespaceContainers: ${existingContainerCountInNs}, namespaceInProgressContainer: ${inProgressContainerCountInNs}) [$invocationNamespace:$action]") | |||
Future.successful(DecisionResults(EnableNamespaceThrottling(dropMsg = totalContainers == 0), 0)) | |||
case NamespaceThrottled if schedulingConfig.allowOverProvisionBeforeThrottle && ceiling(limit * schedulingConfig.namespaceOverProvisionBeforeThrottleRatio) - existingContainerCountInNs - inProgressContainerCountInNs > 0 => | |||
case NamespaceThrottled | |||
if schedulingConfig.allowOverProvisionBeforeThrottle && ceiling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems our CI tests do not run checkScalaFmtAll
.
Some codes already included in the master branch are incorrectly formatted.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5348 +/- ##
==========================================
- Coverage 81.04% 76.35% -4.69%
==========================================
Files 240 240
Lines 14391 14393 +2
Branches 605 603 -2
==========================================
- Hits 11663 10990 -673
- Misses 2728 3403 +675 ☔ View full report in Codecov by Sentry. |
c.activationClient.close().andThen { | ||
case _ => self ! ClientClosed | ||
case _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also seeing the issue of orphaned data for a test action that I frequently update every minute outside of the scheduler deployments. Does it stand to reason this was really the same issue and this will also fix that issue?
The issue started happening I want to say around the same time of the October 13th commit that introduced the regression so it seems likely to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
As I stated here, when the ActivationClientProxy initiates the clean-up process, it didn't let the ContainerProxy know. The ETCD data clean-up is handled by ContainerProxy, accordingly, the data was not removed.
Just have two comments to address, otherwise I think it LGTM! |
nit: can we rename the pr to |
I ran a test with the changes here and can no longer reproduce the issue! |
LGTM |
…pache#5348) * Fix the regression * Apply scalaFmt * Fix test cases * Make the MemoryQueueTests stable * Make the ActivationClientProxyTests stable
Description
This is to fix the regression introduced from #5338
Related issue and scope
My changes affect the following components
Types of changes
Checklist: