Delete ETCD data first when disabling the invoker #5333

style95 · 2022-10-10T10:23:15Z

Description

This is to delete ETCD data first when disabling an invoker.
Previously data was removed only after an activation client proxy is completely closed.

But when there are many containers running for a given action and we disable an invoker, it may cause a sudden container creation spike. Since some containers are suddenly removed, a scheduler would decide to provision similar numbers of containers at once.
If we delete the ETCD data first, then schedulers can initiate container provisioning while disabled containers are still processing the last activation for them.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

style95 · 2022-10-10T10:23:58Z

...rc/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerProxy.scala

-          data.action.fullyQualifiedName(withVersion = true),
-          data.action.rev,
-          None)
+        cleanUp(data.container, None, false)


This just deletes the container only as data is already removed here.
https://github.com/apache/openwhisk/pull/5333/files#diff-23fc1c1634cd8a2e99b4cfbf342527248a53f5987911af80ab5d910ce7864d70R623

It's guaranteed that the etcd data will be cleaned up right since the data management service will retry indefinitely once it receives it? Is there any case this condition can be hit without the GracefulShutdown event occurring?

Sort of a side question but what happens to etcd data if an invoker instance is lost unexpectedly? If the invoker is lost is the data stuck indefinitely until an invoker with the same id is restarted to wipe it on startup?

All data are supposed to be removed after lease timeout this is because no more keepalive request is sent when an invoker is unexpectedly terminated.
The default timeout is 10s. So there can be wrong data for 10s.
It's a tradeoff. When there is a temporal network rupture and no keepalive could be sent to ETCD, the data can be unexpectedly removed if this timeout is too short.

codecov-commenter · 2022-10-10T10:54:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.76%. Comparing base (236ca5e) to head (4620203).
Report is 81 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5333      +/-   ##
==========================================
- Coverage   81.69%   76.76%   -4.94%     
==========================================
  Files         240      240              
  Lines       14264    14267       +3     
  Branches      598      617      +19     
==========================================
- Hits        11653    10952     -701     
- Misses       2611     3315     +704

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

style95 · 2022-10-11T01:52:24Z

ansible/roles/controller/tasks/deploy.yml

@@ -154,6 +154,7 @@

      "CONFIG_whisk_info_date": "{{ whisk.version.date }}"
      "CONFIG_whisk_info_buildNo": "{{ docker.image.tag }}"
+      "CONFIG_whisk_cluster_name": "{{ whisk.cluster_name | lower }}"


This is one of the mandatory configs.
This is used as a prefix for the ETCD data.
When we deploy a new version, each version of the components can communicate with each other with based on this information.

Can we default this to whisk in the .conf? That's what is used right now right? Or is it like the kafka prefix where it would be in front of the whisk prefix so if it's not defined in config then it just uses ''

If no configuration is provided, the default value whisk will be used.
But the problem is if you use different values for each deployment to distinguish them, it wouldn't work as the scheduler data in ETCD only applies the prefix.
Controllers fetch scheduler endpoints with this prefix and schedulers fetch available invokers with this prefix.

Yea that makes sense just if you don't care about that you don't have to define this right? If you did one deployment with no configuration it will be whisk and then you could do a new deployment defining the config and then there won't be any clash

style95 · 2022-10-11T07:08:43Z

...ala/org/apache/openwhisk/core/containerpool/v2/test/FunctionPullingContainerProxyTests.scala

@@ -961,7 +961,7 @@ class FunctionPullingContainerProxyTests
    }
    client.send(machine, ClientClosed)

-    probe.expectMsgAllOf(ContainerRemoved(true), Transition(machine, Running, Removing))
+    probe.expectMsgAllOf(ContainerRemoved(false), Transition(machine, Running, Removing))


Since the true here means, it replaces a prewarm container, false is the correct value.

bdoyle0182 · 2022-10-11T16:27:43Z

So I'm understanding correctly that there won't be as much whiplash of spawning new containers when an invoker is disabled now creating new containers on a new invoker. Since activations will still be keeping up with demand on the disabled invoker until it's shutdown, it should more gradually schedule new containers on a new invoker?

bdoyle0182 · 2022-10-11T16:36:18Z

...rc/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerProxy.scala

@@ -624,6 +619,15 @@ class FunctionPullingContainerProxy(
    // ContainerProxy will be terminated by StateTimeout if there is no further activation
    case Event(GracefulShutdown, data: WarmData) =>
      logging.info(this, s"receive GracefulShutdown for action: ${data.action}")
+      // clean up the etcd data first so that the scheduler can provision more containers in advance.
+      dataManagementService ! UnregisterData(


Tangential but would it be useful to add an api to the invoker to show 1. how many action containers are still active from the container pool and 2. by action how many containers there are? That way you can use the api to check if the disabling process has completed when all containers are shut down and it's safe to shutdown the service.

Yes, that would be great.

bdoyle0182 · 2022-10-11T16:42:52Z

couple questions but LGTM

style95 · 2022-10-12T01:35:45Z

So I'm understanding correctly that there won't be as much whiplash of spawning new containers when an invoker is disabled now creating new containers on a new invoker. Since activations will still be keeping up with demand on the disabled invoker until it's shutdown, it should more gradually schedule new containers on a new invoker?

Yes, that would be the best.
My intention with this PR is to spawn required containers in advance while some disabled containers are still "active".

* Delete ETCD data first when disabling the invoker * Add the cluster name to controllers and invokers * Handle unhandled message in the Removing state

Delete ETCD data first when disabling the invoker

48e4699

style95 commented Oct 10, 2022

View reviewed changes

Add the cluster name to controllers and invokers

57f9200

style95 commented Oct 11, 2022

View reviewed changes

bdoyle0182 reviewed Oct 11, 2022

View reviewed changes

bdoyle0182 approved these changes Oct 11, 2022

View reviewed changes

ningyougang approved these changes Oct 12, 2022

View reviewed changes

Handle unhandled message in the Removing state

4620203

style95 merged commit 145971b into apache:master Oct 14, 2022

bdoyle0182 mentioned this pull request Nov 1, 2022

Make the test stable #5346

Merged

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delete ETCD data first when disabling the invoker #5333

Delete ETCD data first when disabling the invoker #5333

Uh oh!

style95 commented Oct 10, 2022

Uh oh!

style95 Oct 10, 2022

Uh oh!

bdoyle0182 Oct 11, 2022

Uh oh!

style95 Oct 12, 2022

Uh oh!

codecov-commenter commented Oct 10, 2022 •

edited

Loading

Uh oh!

style95 Oct 11, 2022

Uh oh!

bdoyle0182 Oct 11, 2022

Uh oh!

style95 Oct 12, 2022

Uh oh!

bdoyle0182 Oct 12, 2022

Uh oh!

style95 Oct 13, 2022

Uh oh!

style95 Oct 11, 2022

Uh oh!

bdoyle0182 commented Oct 11, 2022

Uh oh!

bdoyle0182 Oct 11, 2022

Uh oh!

style95 Oct 12, 2022

Uh oh!

bdoyle0182 commented Oct 11, 2022

Uh oh!

style95 commented Oct 12, 2022

Uh oh!

Uh oh!

Delete ETCD data first when disabling the invoker #5333

Delete ETCD data first when disabling the invoker #5333

Uh oh!

Conversation

style95 commented Oct 10, 2022

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdoyle0182 commented Oct 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdoyle0182 commented Oct 11, 2022

Uh oh!

style95 commented Oct 12, 2022

Uh oh!

Uh oh!

codecov-commenter commented Oct 10, 2022 •

edited

Loading