-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add Scheduler Queue Metric for Not Processing Any Activations #5386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scheduler Queue Metric for Not Processing Any Activations #5386
Conversation
@@ -574,7 +576,7 @@ class MemoryQueue(private val etcdClient: EtcdClient, | |||
case Event(DropOld, _) => | |||
if (queue.nonEmpty && Duration | |||
.between(queue.head.timestamp, clock.now()) | |||
.compareTo(Duration.ofMillis(actionRetentionTimeout)) < 0) { | |||
.compareTo(Duration.ofMillis(actionRetentionTimeout)) >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based off of the same comparison in the checkToDropStaleActivation
function, I think this is actually a bug such that the error log below this will never get emitted in order to debug the container state of the queue when the not processing case occurs. Could use sanity check
Codecov Report
@@ Coverage Diff @@
## master #5386 +/- ##
==========================================
- Coverage 76.65% 4.52% -72.13%
==========================================
Files 240 240
Lines 14574 14588 +14
Branches 646 629 -17
==========================================
- Hits 11171 660 -10511
- Misses 3403 13928 +10525
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
if (currentTime - lastActivationExecutedTime.get() > maxRetentionMs) { | ||
MetricEmitter.emitGaugeMetric( | ||
LoggingMarkers | ||
.SCHEDULER_QUEUE_NOT_PROCESSING(invocationNamespace, action.asString, action.toStringWithoutVersion), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this PR.
One question is if we collect the counter metrics here, what kinds of information we can get?
Can we differentiate the case where an activation request does not arrive because there are not enough containers and the case where dangling containers make the system cannot handle activations?
Do we also need to collect the counter of activation requests that actually got activations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this metric, it should not fire for the case where an activation request does not arrive because there are not enough containers to adequately meet the throughput of how many activations are in the queue. In such a case some activations should be making progress in which case it won't pass the check that the last activation executed time is greater than the retention timeout of the queue. The value of the metric is thus a boolean 1 or 0, either the action is having problems or it's not.
I can try to think of if there's any additional data we should be emitting when hitting the case, but in my comment above I think I also have a bug fix where a log emitting the state of the queue isn't properly getting logged when activations are timed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the execution time of an action is bigger than the threshold, and there are not enough containers, doesn't it fire the metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I noted that in the description
The system configured queue retention timeout is not longer than the max timeout of an action such that all available containers are in use up to the limit and are validly in use longer than the max retention timeout of the queue and the system operator should take action anyways to correct that.
I could also look into making it such that the queue retention time is dynamic to the action timeout of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I added an additional check that the time since last activation executed must also be greater than the action timeout since I think that's a fair case if the system is configured to support long running activations, you may still not want the queue retention time to also be that high
…#5386) * Add Scheduler Queue Metric for Not Processing Any Activations * fix timeout comparison * account for action timeout being longer than queue retention --------- Co-authored-by: Brendan Doyle <[email protected]> (cherry picked from commit 60ca660)
Description
The memory queue will now track the last time that an activation has been pulled by a grpc request to send to an invoker. If an activation is dropped from the queue from aging out and there hasn't been a single activation grabbed by an invoker for the entire duration that the activation sat in the queue, then this gauge will fire. This is needed as a fail safe to be made aware to any edge cases in the system where etcd data gets out of sync and containers are thought to exist by the scheduler that do not exist. Simply looking at activations timing out is not enough to determine an issue since the action may just be hitting throttling limits for number of containers.
So this should fire in only one case:
Related issue and scope
My changes affect the following components
Types of changes
Checklist: