Add Scheduler Queue Metric for Not Processing Any Activations #5386

bdoyle0182 · 2023-03-06T23:30:03Z

Description

The memory queue will now track the last time that an activation has been pulled by a grpc request to send to an invoker. If an activation is dropped from the queue from aging out and there hasn't been a single activation grabbed by an invoker for the entire duration that the activation sat in the queue, then this gauge will fire. This is needed as a fail safe to be made aware to any edge cases in the system where etcd data gets out of sync and containers are thought to exist by the scheduler that do not exist. Simply looking at activations timing out is not enough to determine an issue since the action may just be hitting throttling limits for number of containers.

So this should fire in only one case:

There's a bug in the scheduler that needs a restart of either the invokers and schedulers to get the queue back into a healthy state.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

bdoyle0182 · 2023-03-07T00:34:35Z

core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/MemoryQueue.scala

@@ -574,7 +576,7 @@ class MemoryQueue(private val etcdClient: EtcdClient,
    case Event(DropOld, _) =>
      if (queue.nonEmpty && Duration
            .between(queue.head.timestamp, clock.now())
-            .compareTo(Duration.ofMillis(actionRetentionTimeout)) < 0) {
+            .compareTo(Duration.ofMillis(actionRetentionTimeout)) >= 0) {


Based off of the same comparison in the checkToDropStaleActivation function, I think this is actually a bug such that the error log below this will never get emitted in order to debug the container state of the queue when the not processing case occurs. Could use sanity check

codecov-commenter · 2023-03-07T01:10:30Z

Codecov Report

Merging #5386 (62d7eb0) into master (96ff327) will decrease coverage by 72.13%.
The diff coverage is 0.00%.

❗ Current head 62d7eb0 differs from pull request most recent head 830da9c. Consider uploading reports for the commit 830da9c to get more accurate results

@@            Coverage Diff             @@
##           master   #5386       +/-   ##
==========================================
- Coverage   76.65%   4.52%   -72.13%     
==========================================
  Files         240     240               
  Lines       14574   14588       +14     
  Branches      646     629       -17     
==========================================
- Hits        11171     660    -10511     
- Misses       3403   13928    +10525

Impacted Files	Coverage Δ
...in/scala/org/apache/openwhisk/common/Logging.scala	`40.08% <0.00%> (-39.66%)`	⬇️
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala	`0.00% <0.00%> (-81.55%)`	⬇️
.../main/scala/org/apache/openwhisk/core/WarmUp.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...ain/scala/org/apache/openwhisk/spi/SpiLoader.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...n/scala/org/apache/openwhisk/utils/JsHelpers.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...scala/org/apache/openwhisk/common/Prometheus.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...scala/org/apache/openwhisk/common/time/Clock.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...scala/org/apache/openwhisk/core/FeatureFlags.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...scala/org/apache/openwhisk/http/CorsSettings.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...ala/org/apache/openwhisk/common/ConfigMXBean.scala	`0.00% <0.00%> (-100.00%)`	⬇️
... and 196 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

style95 · 2023-03-07T01:28:30Z

core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/MemoryQueue.scala

+      if (currentTime - lastActivationExecutedTime.get() > maxRetentionMs) {
+        MetricEmitter.emitGaugeMetric(
+          LoggingMarkers
+            .SCHEDULER_QUEUE_NOT_PROCESSING(invocationNamespace, action.asString, action.toStringWithoutVersion),


I agree with this PR.
One question is if we collect the counter metrics here, what kinds of information we can get?
Can we differentiate the case where an activation request does not arrive because there are not enough containers and the case where dangling containers make the system cannot handle activations?

Do we also need to collect the counter of activation requests that actually got activations?

For this metric, it should not fire for the case where an activation request does not arrive because there are not enough containers to adequately meet the throughput of how many activations are in the queue. In such a case some activations should be making progress in which case it won't pass the check that the last activation executed time is greater than the retention timeout of the queue. The value of the metric is thus a boolean 1 or 0, either the action is having problems or it's not.

I can try to think of if there's any additional data we should be emitting when hitting the case, but in my comment above I think I also have a bug fix where a log emitting the state of the queue isn't properly getting logged when activations are timed out.

But if the execution time of an action is bigger than the threshold, and there are not enough containers, doesn't it fire the metric?

Right I noted that in the description

The system configured queue retention timeout is not longer than the max timeout of an action such that all available containers are in use up to the limit and are validly in use longer than the max retention timeout of the queue and the system operator should take action anyways to correct that.

I could also look into making it such that the queue retention time is dynamic to the action timeout of the function

Okay I added an additional check that the time since last activation executed must also be greater than the action timeout since I think that's a fair case if the system is configured to support long running activations, you may still not want the queue retention time to also be that high

…#5386) * Add Scheduler Queue Metric for Not Processing Any Activations * fix timeout comparison * account for action timeout being longer than queue retention --------- Co-authored-by: Brendan Doyle <[email protected]> (cherry picked from commit 60ca660)

Brendan Doyle added 2 commits March 6, 2023 15:22

Add Scheduler Queue Metric for Not Processing Any Activations

536c413

fix timeout comparison

d8a6753

bdoyle0182 commented Mar 7, 2023

View reviewed changes

style95 approved these changes Mar 7, 2023

View reviewed changes

account for action timeout being longer than queue retention

830da9c

bdoyle0182 merged commit 60ca660 into apache:master Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Scheduler Queue Metric for Not Processing Any Activations #5386

Add Scheduler Queue Metric for Not Processing Any Activations #5386

Uh oh!

bdoyle0182 commented Mar 6, 2023 •

edited

Loading

Uh oh!

bdoyle0182 Mar 7, 2023

Uh oh!

codecov-commenter commented Mar 7, 2023 •

edited

Loading

Uh oh!

style95 Mar 7, 2023

Uh oh!

bdoyle0182 Mar 7, 2023

Uh oh!

style95 Mar 7, 2023

Uh oh!

bdoyle0182 Mar 7, 2023

Uh oh!

bdoyle0182 Mar 7, 2023

Uh oh!

Uh oh!

Add Scheduler Queue Metric for Not Processing Any Activations #5386

Add Scheduler Queue Metric for Not Processing Any Activations #5386

Uh oh!

Conversation

bdoyle0182 commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

bdoyle0182 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

style95 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

bdoyle0182 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

style95 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

bdoyle0182 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

bdoyle0182 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdoyle0182 commented Mar 6, 2023 •

edited

Loading

codecov-commenter commented Mar 7, 2023 •

edited

Loading