Skip to content

Add Scheduler Queue Metric for Not Processing Any Activations #5386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 7, 2023

Conversation

bdoyle0182
Copy link
Contributor

@bdoyle0182 bdoyle0182 commented Mar 6, 2023

Description

The memory queue will now track the last time that an activation has been pulled by a grpc request to send to an invoker. If an activation is dropped from the queue from aging out and there hasn't been a single activation grabbed by an invoker for the entire duration that the activation sat in the queue, then this gauge will fire. This is needed as a fail safe to be made aware to any edge cases in the system where etcd data gets out of sync and containers are thought to exist by the scheduler that do not exist. Simply looking at activations timing out is not enough to determine an issue since the action may just be hitting throttling limits for number of containers.

So this should fire in only one case:

  1. There's a bug in the scheduler that needs a restart of either the invokers and schedulers to get the queue back into a healthy state.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@@ -574,7 +576,7 @@ class MemoryQueue(private val etcdClient: EtcdClient,
case Event(DropOld, _) =>
if (queue.nonEmpty && Duration
.between(queue.head.timestamp, clock.now())
.compareTo(Duration.ofMillis(actionRetentionTimeout)) < 0) {
.compareTo(Duration.ofMillis(actionRetentionTimeout)) >= 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based off of the same comparison in the checkToDropStaleActivation function, I think this is actually a bug such that the error log below this will never get emitted in order to debug the container state of the queue when the not processing case occurs. Could use sanity check

@codecov-commenter
Copy link

codecov-commenter commented Mar 7, 2023

Codecov Report

Merging #5386 (62d7eb0) into master (96ff327) will decrease coverage by 72.13%.
The diff coverage is 0.00%.

❗ Current head 62d7eb0 differs from pull request most recent head 830da9c. Consider uploading reports for the commit 830da9c to get more accurate results

@@            Coverage Diff             @@
##           master   #5386       +/-   ##
==========================================
- Coverage   76.65%   4.52%   -72.13%     
==========================================
  Files         240     240               
  Lines       14574   14588       +14     
  Branches      646     629       -17     
==========================================
- Hits        11171     660    -10511     
- Misses       3403   13928    +10525     
Impacted Files Coverage Δ
...in/scala/org/apache/openwhisk/common/Logging.scala 40.08% <0.00%> (-39.66%) ⬇️
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala 0.00% <0.00%> (-81.55%) ⬇️
.../main/scala/org/apache/openwhisk/core/WarmUp.scala 0.00% <0.00%> (-100.00%) ⬇️
...ain/scala/org/apache/openwhisk/spi/SpiLoader.scala 0.00% <0.00%> (-100.00%) ⬇️
...n/scala/org/apache/openwhisk/utils/JsHelpers.scala 0.00% <0.00%> (-100.00%) ⬇️
...scala/org/apache/openwhisk/common/Prometheus.scala 0.00% <0.00%> (-100.00%) ⬇️
...scala/org/apache/openwhisk/common/time/Clock.scala 0.00% <0.00%> (-100.00%) ⬇️
...scala/org/apache/openwhisk/core/FeatureFlags.scala 0.00% <0.00%> (-100.00%) ⬇️
...scala/org/apache/openwhisk/http/CorsSettings.scala 0.00% <0.00%> (-100.00%) ⬇️
...ala/org/apache/openwhisk/common/ConfigMXBean.scala 0.00% <0.00%> (-100.00%) ⬇️
... and 196 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

if (currentTime - lastActivationExecutedTime.get() > maxRetentionMs) {
MetricEmitter.emitGaugeMetric(
LoggingMarkers
.SCHEDULER_QUEUE_NOT_PROCESSING(invocationNamespace, action.asString, action.toStringWithoutVersion),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this PR.
One question is if we collect the counter metrics here, what kinds of information we can get?
Can we differentiate the case where an activation request does not arrive because there are not enough containers and the case where dangling containers make the system cannot handle activations?

Do we also need to collect the counter of activation requests that actually got activations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this metric, it should not fire for the case where an activation request does not arrive because there are not enough containers to adequately meet the throughput of how many activations are in the queue. In such a case some activations should be making progress in which case it won't pass the check that the last activation executed time is greater than the retention timeout of the queue. The value of the metric is thus a boolean 1 or 0, either the action is having problems or it's not.

I can try to think of if there's any additional data we should be emitting when hitting the case, but in my comment above I think I also have a bug fix where a log emitting the state of the queue isn't properly getting logged when activations are timed out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the execution time of an action is bigger than the threshold, and there are not enough containers, doesn't it fire the metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I noted that in the description

The system configured queue retention timeout is not longer than the max timeout of an action such that all available containers are in use up to the limit and are validly in use longer than the max retention timeout of the queue and the system operator should take action anyways to correct that.

I could also look into making it such that the queue retention time is dynamic to the action timeout of the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added an additional check that the time since last activation executed must also be greater than the action timeout since I think that's a fair case if the system is configured to support long running activations, you may still not want the queue retention time to also be that high

@bdoyle0182 bdoyle0182 merged commit 60ca660 into apache:master Mar 7, 2023
mtt-merz pushed a commit to mtt-merz/openwhisk that referenced this pull request Oct 22, 2023
…#5386)

* Add Scheduler Queue Metric for Not Processing Any Activations

* fix timeout comparison

* account for action timeout being longer than queue retention

---------

Co-authored-by: Brendan Doyle <[email protected]>
(cherry picked from commit 60ca660)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants