-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Delete ETCD data first when disabling the invoker #5333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
data.action.fullyQualifiedName(withVersion = true), | ||
data.action.rev, | ||
None) | ||
cleanUp(data.container, None, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just deletes the container only as data is already removed here.
https://github.com/apache/openwhisk/pull/5333/files#diff-23fc1c1634cd8a2e99b4cfbf342527248a53f5987911af80ab5d910ce7864d70R623
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's guaranteed that the etcd data will be cleaned up right since the data management service will retry indefinitely once it receives it? Is there any case this condition can be hit without the GracefulShutdown
event occurring?
Sort of a side question but what happens to etcd data if an invoker instance is lost unexpectedly? If the invoker is lost is the data stuck indefinitely until an invoker with the same id is restarted to wipe it on startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All data are supposed to be removed after lease timeout this is because no more keepalive request is sent when an invoker is unexpectedly terminated.
The default timeout is 10s. So there can be wrong data for 10s.
It's a tradeoff. When there is a temporal network rupture and no keepalive could be sent to ETCD, the data can be unexpectedly removed if this timeout is too short.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5333 +/- ##
==========================================
- Coverage 81.69% 76.76% -4.94%
==========================================
Files 240 240
Lines 14264 14267 +3
Branches 598 617 +19
==========================================
- Hits 11653 10952 -701
- Misses 2611 3315 +704 ☔ View full report in Codecov by Sentry. |
@@ -154,6 +154,7 @@ | |||
|
|||
"CONFIG_whisk_info_date": "{{ whisk.version.date }}" | |||
"CONFIG_whisk_info_buildNo": "{{ docker.image.tag }}" | |||
"CONFIG_whisk_cluster_name": "{{ whisk.cluster_name | lower }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the mandatory configs.
This is used as a prefix for the ETCD data.
When we deploy a new version, each version of the components can communicate with each other with based on this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we default this to whisk
in the .conf
? That's what is used right now right? Or is it like the kafka prefix where it would be in front of the whisk
prefix so if it's not defined in config then it just uses ''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no configuration is provided, the default value whisk
will be used.
But the problem is if you use different values for each deployment to distinguish them, it wouldn't work as the scheduler data in ETCD only applies the prefix.
Controllers fetch scheduler endpoints with this prefix and schedulers fetch available invokers with this prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that makes sense just if you don't care about that you don't have to define this right? If you did one deployment with no configuration it will be whisk and then you could do a new deployment defining the config and then there won't be any clash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
@@ -961,7 +961,7 @@ class FunctionPullingContainerProxyTests | |||
} | |||
client.send(machine, ClientClosed) | |||
|
|||
probe.expectMsgAllOf(ContainerRemoved(true), Transition(machine, Running, Removing)) | |||
probe.expectMsgAllOf(ContainerRemoved(false), Transition(machine, Running, Removing)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the true
here means, it replaces a prewarm container, false
is the correct value.
So I'm understanding correctly that there won't be as much whiplash of spawning new containers when an invoker is disabled now creating new containers on a new invoker. Since activations will still be keeping up with demand on the disabled invoker until it's shutdown, it should more gradually schedule new containers on a new invoker? |
@@ -624,6 +619,15 @@ class FunctionPullingContainerProxy( | |||
// ContainerProxy will be terminated by StateTimeout if there is no further activation | |||
case Event(GracefulShutdown, data: WarmData) => | |||
logging.info(this, s"receive GracefulShutdown for action: ${data.action}") | |||
// clean up the etcd data first so that the scheduler can provision more containers in advance. | |||
dataManagementService ! UnregisterData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangential but would it be useful to add an api to the invoker to show 1. how many action containers are still active from the container pool and 2. by action how many containers there are? That way you can use the api to check if the disabling process has completed when all containers are shut down and it's safe to shutdown the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great.
couple questions but LGTM |
Yes, that would be the best. |
* Delete ETCD data first when disabling the invoker * Add the cluster name to controllers and invokers * Handle unhandled message in the Removing state
Description
This is to delete ETCD data first when disabling an invoker.
Previously data was removed only after an activation client proxy is completely closed.
But when there are many containers running for a given action and we disable an invoker, it may cause a sudden container creation spike. Since some containers are suddenly removed, a scheduler would decide to provision similar numbers of containers at once.
If we delete the ETCD data first, then schedulers can initiate container provisioning while disabled containers are still processing the last activation for them.
Related issue and scope
My changes affect the following components
Types of changes
Checklist: