Skip to content

[POEM] Allow Users To Configure Max Action Container Concurrency Under Their Namespace Limit #5288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions proposals/POEM-4-action-concurrency-limit-within-namespace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
<!--
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
-->

# Title
User Defined Action Level Concurrency Limits Within Confines of Global Namespace Limit

## Status
* Current state: In-progress
* Author(s): @bdoyle0182 (Github ID)

## Summary and Motivation

Currently, openwhisk has a single concurrency limit for managing auto scaling within a namespace. This limit for each namespace is managed
rightly by system administrators to maintain a good balance between the namespaces of the system and the total system's resources.

However, this does not allow for the user to control how their applications scale within the namespace that they are operating. There is no
fairness across functions within a namespace. The semantics of a namespace can vary heavily depending on how openwhisk is being used. A namespace
could represent an organization for public cloud, a group within an organization, an application of functions, a logical grouping of applications
(for example putting all of your interactions with slack in one namespace).

The problem is that a single function can runaway and end up using all of the namespace's resources. It shouldn't be on the system administrators
to provide this fairness as it's dependent on the application and what the user wants. They may want the existing behavior to allow any action
to scale up to the total namespace's resources, they may want to restrict one less prioritized function scale up to a smaller threshold so it can't eat
the entire namespace's resources but still allow other high priority functions access to the entire namespace's resources, or they may want to provide
limits to all of their actions that add up to their namespace limit which will guarantee each action in their namespace can have up to their defined
action concurrency limits similar to other FaaS providers concept of reserved concurrency for actions.

With the major revision to how Openwhisk processes activations with the new scheduler, such a feature becomes extremely easy to implement by just adding
a single new limit that users can configure on their action document.

## Proposed changes: Architecture Diagram (optional), and Design

Add a optional `maxContainerConcurrency` limit field to action documents in the limits section. This limit will be used in the scheduler when deciding
if there is capacity for the action to scale up more containers. Previously, the scheduler was completely naive of functions across a namespace when provisioning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if there is capacity for the action to scale up more containers. Previously, the scheduler was completely naive of functions across a namespace when provisioning
if there is capacity for the action to scale out more containers. Previously, the scheduler was completely naive of functions across a namespace when provisioning

more containers, but if this limit is defined the scheduler will only allow to provision containers up to the defined action limit (which must be less than or equal to the namespace limit).

### Implementation details

A working PR of this POEM is already done in which implementation details can be reviewed but I will describe implementation considerations here. Once the POEM is approved,
I will add any feedback from the POEM, tests, and documentation.

- The scheduler decision maker uses the min of action container concurrency limit and the namespace concurrency limit. If the action limit is less than the namespace
limit, it will check both that the action hasn't used up all its capacity and that the namespace still has capacity if the action does still have capacity.
- The new limit `maxContainerConcurrency` on the action document is an optional field. If the field does not exist, the action limit used by the system is
the namespace limit making this an optional feature.
- The one thing not yet included in the implementation param is a parameter on the create action api which will allow the user to delete the limit field so that
the action will rely on the namespace limit again.
- When creating an action, the api will validate that your action container concurrency limit is less than or equal to the namespace concurrency limit. If it is greater,
the upload will fail with a BadRequest and error message that the limit must be less than the namespace limit with the namespace limit value included in the message.
- If the system admin lowers a namespace's concurrency limit below an amount that an existing action document has already configured, it will not break the action.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if the sum of the maxContainerConcurrency of all actions exceeds the namespace limit, how does it guarantee fairness?
Do we leave it to the users?

Copy link
Contributor Author

@bdoyle0182 bdoyle0182 Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the user's responsibility. The idea is that users can configure high priority functions to have a high limit such that it can still potentially get more than another low priority function and that lower priority action may not get up to its lower limit. As an example of a namespace with a limit of 20 and 3 actions A, B, and C:

  • A configures limit of 5
  • B configures limit of 5
  • C configures limit of 15

C gets a burst of traffic that uses all 15 of its limit, there is now only 5 remaining for both A and B. That's fine as a user might want this level of scaling at C at the expense of the max A and B can now scale to being 2 or 3 each instead of 5.

Or the user can configure their limits perfectly up to their namespace limit to guarantee fairness such that A gets 5, B gets 5, and C gets 10. I think this level of flexibility to the user is a good thing.

I didn't want to refernce aws lambda explicitly, but in comparison to their concept of reserved concurrency for individual functions I think this provides more flexibility to the user. Reserved concurrency on lambda takes away from the total pool when configuring on a function so if I configure 5 to an action and the account limit is 20, there's now 15 capacity for other functions at all times. Well what if that function is not doing something most of the time? You've now taken away capacity permanently from your pool for a function that barely runs in exchange for a guarantee it will always be able to scale up to 5. In this proposal, you still have the ability to give yourself that guarantee if you provision evenly across your namespace as the user; but also have the flexibility to be smarter with what your traffic patterns look like in giving yourself additional overprovisioned / high priority capacity across your functions (which is sort of what serverless is all about.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so this is mainly to limit an action that may consume huge resources.
What is the relationship between actions with/without maxContainerConcurrency when they are invoked concurrently?
Even if action A, B, and C are configured like the above limits, any actions can still be invoked as long as there is enough capacity in the namespace, is that correct?

Copy link
Contributor Author

@bdoyle0182 bdoyle0182 Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes every action except A, B, and C. A, B, and C can only have up to the max of whatever their configured limit is regardless of whether their is still additional capacity in the namespace. If action D came in with no limit configured for itself, its action limit is just inferred by the scheduler to be the namespace limit so it can have up to the full 20.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thus if no limits are configured for any action in a namespace, the behavior remains identical to the existing behavior of openwhisk; any action can have up to the full namespace limit making this feature completely opt-in / non-breaking to the existing paradigm

Since the scheduler just decides what the limit is to use to determine capacity based on the min of the namespace and action limit, it will therefore just use
the namespace limit as the capacity limit. Therefore, there is no action required or side effects or coordination required from the system admin wanting to lower the namespace limit.
However, if the user wants to redeploy the same function with the same limit that is now over the namespace limit; the api will now reject the action upload until the action limit
is lowered below the new namespace limit.
- A user may want to update their action to go back to just relying on the namespace limit. Since updates to action documents copy over limits in the update even if not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be included in the PR too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will, just haven't gotten a chance to add it yet

supplied on the request object, a boolean param will need to be added to the create action api so that the field is not copied in the update. **This is the one thing I still
need to add in the src code of the implementation PR.**
- In the scheduler, if the action limit is hit and new containers cannot be provisioned for the action but there is still capacity available for the namespace, namespace throttling
will not be turned on. The action queue will rely on action throttling if the queue grows too large if this case is hit. Namespace throttling will still be turned on if
the total containers hits the namespace limit.

## Integration and Migration plan (optional)

The feature is fully backwards compatible with existing action documents since the new limit is an optional field. If the limit is not defined on an action document,
the existing behavior is used where the action can have up to the namespace concurrency limit so there is no change to behavior if the feature is not used.
If using the old scheduler and the limit is defined on the action document, the limit just won't do anything until migrated to the new scheduler.