-
Notifications
You must be signed in to change notification settings - Fork 2k
Add FS health check Failure metric #17949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ Gradle check result for ebe8756: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
@arpitpatawat related failures. Can you fix them and we should be good to get this in?
|
bacbcd9
to
4450222
Compare
❌ Gradle check result for 4450222: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 166088c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
@@ -213,6 +229,7 @@ private void monitorFSHealth() { | |||
} catch (IllegalStateException e) { | |||
logger.error("health check failed", e); | |||
brokenLock = true; | |||
fsHealthFailCounter.add(1.0, Tags.create().addTag(NODE_ID, nodeEnv.nodeId())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to add the metrics during exception as well? Ideally this is not FS issue we are not able to get the nodeDataPaths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the exception is due to failure while retrieving the nodeDataPaths however this is logged as health check failure and makes the broken lock true. Upon checking the getHealth()
method in the same class, it marks the status as unhealthy due to broken lock being true. Another reason is if there is an error while retrieving the paths, we would not be able to proceed with creating temp file and checking the FS health status, so I thought to add the failure metric during exception. Let me know your thoughts otherwise.
@@ -423,7 +426,11 @@ public String executor() { | |||
} | |||
|
|||
void failNode(String reason) { | |||
clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerChecksFailureCounter, 1.0); | |||
clusterManagerMetrics.incrementCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets have both the metrics as this will then be a breaking change for downstream
import java.util.Set; | ||
import java.util.function.BiConsumer; | ||
import java.util.function.Consumer; | ||
import java.util.function.Predicate; | ||
|
||
import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap; | ||
import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY; | ||
import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using tracing attribute here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use the key node_id
for the tag attribute and I found this to be already existing in the codebase, so consumed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for traces not metrics.
@@ -423,7 +426,11 @@ public String executor() { | |||
} | |||
|
|||
void failNode(String reason) { | |||
clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerChecksFailureCounter, 1.0); | |||
clusterManagerMetrics.incrementCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. Can we add new metric instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will add a new metric for both.
@Bukhtawar @rajiv-kv Can you please take a look at this PR. |
clusterManagerMetrics.incrementCounter( | ||
clusterManagerMetrics.followerChecksFailureCounter, | ||
1.0, | ||
Optional.ofNullable(Tags.create().addTag(NODE_ID, discoveryNode.getId())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we require to tag the metric with node_id
? Given that the attribute value will be unique per node, how would this be useful for aggregation / montioring ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeId will help us know on which node the check has failed. As this will be captured on master. Do let me know your thoughts @rajiv-kv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nishchay21 node-ip is present in the logs of cluster-manager. Would that information suffice ?
If a tag is associated with the metric, i assume this will translate to one follower-check metric for every node in the cluster. Wouldn't this incur additional memory overhead in the active cluster-manager besides the cost of emitting per-node metric ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but this would only be populated if the node is lagging and will get populated for those nodes itself. We majorly want to capture the reason that there is a node drop and correlate the reason to lagging. We also looked at the node-left part to enable metrics on the reason. Do you feel it is better to add there then correlate from here?
|
||
if (currentUnhealthyPaths != null && !currentUnhealthyPaths.isEmpty()) { | ||
fsHealthFailCounter.add(1.0, Tags.create().addTag(NODE_ID, nodeEnv.nodeId())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to emit this metric from the caller on failures ? In that way we also have the context of leader/follower/preVoteCollector on which the fscheck failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if caller is not able to get response back of the health check not sure if we will be able to return back this to caller as well. Wouldn't it be helpful to build this correlation outside?
For follower leader on caller might still be helpful thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the monitorFSHealth() method is called periodically, hence the metric will be published as soon the issue is detected, independent to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case, FsHealthMonitor
might be the correct place to emit this metric, since the periodicity of the metric will be same across all the nodes in the cluster.
FollowerChecker and LeaderChecker frequency depends on the settings configuration and the type of node (follower / leader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I agree the instrumentation should be done independently. Consider the case where the node has been kicked out of the cluster(no follower checker), the join requests to the cluster gets blocked if the node isn't healthy. Instrumenting at a monitor level will give us actual metrics on the FS health
❌ Gradle check result for c4c65b3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 3522d20: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
import java.util.Set; | ||
import java.util.function.BiConsumer; | ||
import java.util.function.Consumer; | ||
import java.util.function.Predicate; | ||
|
||
import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap; | ||
import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY; | ||
import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for traces not metrics.
long differenceInVersion = version - appliedVersion; | ||
clusterManagerMetrics.incrementCounter( | ||
clusterManagerMetrics.lagCounter, | ||
(double) differenceInVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't publish version diff.
@@ -83,7 +96,11 @@ public ClusterManagerMetrics(MetricsRegistry metricsRegistry) { | |||
"Counter for number of successful async fetches", | |||
COUNTER_METRICS_UNIT | |||
); | |||
|
|||
lagCounter = metricsRegistry.createCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the usecase of emitting this metric ? We will have to consider Term
and Version
together for determining the cluster-state version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure consistency, at the end of the periodic run
should we add a method emitMetrics
which invokes getHealth
method to get a consistent state of the health and accordingly emit a metrics?
I feel that should simplify the code and make it more maintainable
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
3522d20
to
518c343
Compare
❌ Gradle check result for 518c343: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
@@ -127,6 +129,11 @@ public class Coordinator extends AbstractLifecycleComponent implements Discovery | |||
public static final long ZEN1_BWC_TERM = 0; | |||
|
|||
private static final Logger logger = LogManager.getLogger(Coordinator.class); | |||
public static final String NODE_LEFT_REASON_LAGGING = "lagging"; | |||
public static final String NODE_LEFT_REASON_DISCONNECTED = "disconnected"; | |||
public static final String NODE_LEFT_REASON_HEALTHCHECK_FAIL = "health check failed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add this as a single string and follow telemetry standards like health.check.failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are just the name for the actual reason, however the requested convention is done.
server/src/main/java/org/opensearch/cluster/coordination/LagDetector.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java
Outdated
Show resolved
Hide resolved
|
||
clusterManagerMetrics.incrementCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move this logic outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the requested changes, this metric is no longer required, removed these changes.
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
78c2721
to
c3ad4a8
Compare
Signed-off-by: Arpit Patawat <[email protected]>
c3ad4a8
to
ff5e7ed
Compare
❕ Gradle check result for ff5e7ed: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17949 +/- ##
============================================
+ Coverage 72.47% 72.57% +0.10%
- Complexity 67034 67150 +116
============================================
Files 5478 5473 -5
Lines 310132 310102 -30
Branches 45087 45058 -29
============================================
+ Hits 224775 225069 +294
+ Misses 66960 66697 -263
+ Partials 18397 18336 -61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
This PR will add a metric when there is FS health check failure. The tag will contain detail about the node whose FS health check has failed. This PR also adds the node tag to leader/follower check failure metric.
Related Issues
None
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.