Skip to content

Delete ETCD data first when disabling the invoker #5333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 14, 2022

Conversation

style95
Copy link
Member

@style95 style95 commented Oct 10, 2022

Description

This is to delete ETCD data first when disabling an invoker.
Previously data was removed only after an activation client proxy is completely closed.

But when there are many containers running for a given action and we disable an invoker, it may cause a sudden container creation spike. Since some containers are suddenly removed, a scheduler would decide to provision similar numbers of containers at once.
If we delete the ETCD data first, then schedulers can initiate container provisioning while disabled containers are still processing the last activation for them.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • [] Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

data.action.fullyQualifiedName(withVersion = true),
data.action.rev,
None)
cleanUp(data.container, None, false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's guaranteed that the etcd data will be cleaned up right since the data management service will retry indefinitely once it receives it? Is there any case this condition can be hit without the GracefulShutdown event occurring?

Sort of a side question but what happens to etcd data if an invoker instance is lost unexpectedly? If the invoker is lost is the data stuck indefinitely until an invoker with the same id is restarted to wipe it on startup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All data are supposed to be removed after lease timeout this is because no more keepalive request is sent when an invoker is unexpectedly terminated.
The default timeout is 10s. So there can be wrong data for 10s.
It's a tradeoff. When there is a temporal network rupture and no keepalive could be sent to ETCD, the data can be unexpectedly removed if this timeout is too short.

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.76%. Comparing base (236ca5e) to head (4620203).
Report is 81 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5333      +/-   ##
==========================================
- Coverage   81.69%   76.76%   -4.94%     
==========================================
  Files         240      240              
  Lines       14264    14267       +3     
  Branches      598      617      +19     
==========================================
- Hits        11653    10952     -701     
- Misses       2611     3315     +704     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -154,6 +154,7 @@

"CONFIG_whisk_info_date": "{{ whisk.version.date }}"
"CONFIG_whisk_info_buildNo": "{{ docker.image.tag }}"
"CONFIG_whisk_cluster_name": "{{ whisk.cluster_name | lower }}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the mandatory configs.
This is used as a prefix for the ETCD data.
When we deploy a new version, each version of the components can communicate with each other with based on this information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we default this to whisk in the .conf? That's what is used right now right? Or is it like the kafka prefix where it would be in front of the whisk prefix so if it's not defined in config then it just uses ''

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no configuration is provided, the default value whisk will be used.
But the problem is if you use different values for each deployment to distinguish them, it wouldn't work as the scheduler data in ETCD only applies the prefix.
Controllers fetch scheduler endpoints with this prefix and schedulers fetch available invokers with this prefix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense just if you don't care about that you don't have to define this right? If you did one deployment with no configuration it will be whisk and then you could do a new deployment defining the config and then there won't be any clash

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

@@ -961,7 +961,7 @@ class FunctionPullingContainerProxyTests
}
client.send(machine, ClientClosed)

probe.expectMsgAllOf(ContainerRemoved(true), Transition(machine, Running, Removing))
probe.expectMsgAllOf(ContainerRemoved(false), Transition(machine, Running, Removing))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the true here means, it replaces a prewarm container, false is the correct value.

@bdoyle0182
Copy link
Contributor

So I'm understanding correctly that there won't be as much whiplash of spawning new containers when an invoker is disabled now creating new containers on a new invoker. Since activations will still be keeping up with demand on the disabled invoker until it's shutdown, it should more gradually schedule new containers on a new invoker?

@@ -624,6 +619,15 @@ class FunctionPullingContainerProxy(
// ContainerProxy will be terminated by StateTimeout if there is no further activation
case Event(GracefulShutdown, data: WarmData) =>
logging.info(this, s"receive GracefulShutdown for action: ${data.action}")
// clean up the etcd data first so that the scheduler can provision more containers in advance.
dataManagementService ! UnregisterData(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential but would it be useful to add an api to the invoker to show 1. how many action containers are still active from the container pool and 2. by action how many containers there are? That way you can use the api to check if the disabling process has completed when all containers are shut down and it's safe to shutdown the service.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great.

@bdoyle0182
Copy link
Contributor

couple questions but LGTM

@style95
Copy link
Member Author

style95 commented Oct 12, 2022

So I'm understanding correctly that there won't be as much whiplash of spawning new containers when an invoker is disabled now creating new containers on a new invoker. Since activations will still be keeping up with demand on the disabled invoker until it's shutdown, it should more gradually schedule new containers on a new invoker?

Yes, that would be the best.
My intention with this PR is to spawn required containers in advance while some disabled containers are still "active".

@style95 style95 merged commit 145971b into apache:master Oct 14, 2022
@bdoyle0182 bdoyle0182 mentioned this pull request Nov 1, 2022
22 tasks
msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022
* Delete ETCD data first when disabling the invoker

* Add the cluster name to controllers and invokers

* Handle unhandled message in the Removing state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants