Skip to content

Add FS health check Failure metric #17949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

arpitpatawat
Copy link

Description

This PR will add a metric when there is FS health check failure. The tag will contain detail about the node whose FS health check has failed. This PR also adds the node tag to leader/follower check failure metric.

Related Issues

None

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for ebe8756: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@owaiskazi19
Copy link
Member

@arpitpatawat related failures. Can you fix them and we should be good to get this in?

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testLoggingOnHungIO" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testLoggingOnHungIO FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:52392A5319A37360]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnHungIOBeyondHealthyTimeout FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:9A8CB2C2BE593BE9]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testSchedulesHealthCheckAtRefreshIntervals" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testSchedulesHealthCheckAtRefreshIntervals FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:6755A9BE8FD65332]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnSinglePathWriteFailure" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnSinglePathWriteFailure FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:49BF9A3A99B7457D]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnIOException" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnIOException FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:8FB87904E30A3C74]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnSinglePathFsyncFailure" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnSinglePathFsyncFailure FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:57DE9DA58242E08]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

Copy link
Contributor

❌ Gradle check result for 4450222: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 166088c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@@ -213,6 +229,7 @@ private void monitorFSHealth() {
} catch (IllegalStateException e) {
logger.error("health check failed", e);
brokenLock = true;
fsHealthFailCounter.add(1.0, Tags.create().addTag(NODE_ID, nodeEnv.nodeId()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to add the metrics during exception as well? Ideally this is not FS issue we are not able to get the nodeDataPaths

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the exception is due to failure while retrieving the nodeDataPaths however this is logged as health check failure and makes the broken lock true. Upon checking the getHealth() method in the same class, it marks the status as unhealthy due to broken lock being true. Another reason is if there is an error while retrieving the paths, we would not be able to proceed with creating temp file and checking the FS health status, so I thought to add the failure metric during exception. Let me know your thoughts otherwise.

@@ -423,7 +426,11 @@ public String executor() {
}

void failNode(String reason) {
clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerChecksFailureCounter, 1.0);
clusterManagerMetrics.incrementCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have both the metrics as this will then be a breaking change for downstream

import java.util.Set;
import java.util.function.BiConsumer;
import java.util.function.Consumer;
import java.util.function.Predicate;

import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap;
import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY;
import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using tracing attribute here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use the key node_id for the tag attribute and I found this to be already existing in the codebase, so consumed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for traces not metrics.

@@ -423,7 +426,11 @@ public String executor() {
}

void failNode(String reason) {
clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerChecksFailureCounter, 1.0);
clusterManagerMetrics.incrementCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Can we add new metric instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add a new metric for both.

@Gaganjuneja
Copy link
Contributor

@Bukhtawar @rajiv-kv Can you please take a look at this PR.

clusterManagerMetrics.incrementCounter(
clusterManagerMetrics.followerChecksFailureCounter,
1.0,
Optional.ofNullable(Tags.create().addTag(NODE_ID, discoveryNode.getId()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require to tag the metric with node_id ? Given that the attribute value will be unique per node, how would this be useful for aggregation / montioring ?

Copy link
Contributor

@nishchay21 nishchay21 Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeId will help us know on which node the check has failed. As this will be captured on master. Do let me know your thoughts @rajiv-kv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishchay21 node-ip is present in the logs of cluster-manager. Would that information suffice ?

If a tag is associated with the metric, i assume this will translate to one follower-check metric for every node in the cluster. Wouldn't this incur additional memory overhead in the active cluster-manager besides the cost of emitting per-node metric ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but this would only be populated if the node is lagging and will get populated for those nodes itself. We majorly want to capture the reason that there is a node drop and correlate the reason to lagging. We also looked at the node-left part to enable metrics on the reason. Do you feel it is better to add there then correlate from here?

Comment on lines 279 to 282

if (currentUnhealthyPaths != null && !currentUnhealthyPaths.isEmpty()) {
fsHealthFailCounter.add(1.0, Tags.create().addTag(NODE_ID, nodeEnv.nodeId()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to emit this metric from the caller on failures ? In that way we also have the context of leader/follower/preVoteCollector on which the fscheck failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if caller is not able to get response back of the health check not sure if we will be able to return back this to caller as well. Wouldn't it be helpful to build this correlation outside?
For follower leader on caller might still be helpful thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the monitorFSHealth() method is called periodically, hence the metric will be published as soon the issue is detected, independent to the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case, FsHealthMonitor might be the correct place to emit this metric, since the periodicity of the metric will be same across all the nodes in the cluster.

FollowerChecker and LeaderChecker frequency depends on the settings configuration and the type of node (follower / leader)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I agree the instrumentation should be done independently. Consider the case where the node has been kicked out of the cluster(no follower checker), the join requests to the cluster gets blocked if the node isn't healthy. Instrumenting at a monitor level will give us actual metrics on the FS health

Copy link
Contributor

❌ Gradle check result for c4c65b3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 3522d20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

import java.util.Set;
import java.util.function.BiConsumer;
import java.util.function.Consumer;
import java.util.function.Predicate;

import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap;
import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY;
import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for traces not metrics.

long differenceInVersion = version - appliedVersion;
clusterManagerMetrics.incrementCounter(
clusterManagerMetrics.lagCounter,
(double) differenceInVersion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't publish version diff.

@@ -83,7 +96,11 @@ public ClusterManagerMetrics(MetricsRegistry metricsRegistry) {
"Counter for number of successful async fetches",
COUNTER_METRICS_UNIT
);

lagCounter = metricsRegistry.createCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usecase of emitting this metric ? We will have to consider Term and Version together for determining the cluster-state version.

Copy link
Contributor

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure consistency, at the end of the periodic run should we add a method emitMetrics which invokes getHealth method to get a consistent state of the health and accordingly emit a metrics?
I feel that should simplify the code and make it more maintainable

Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
Signed-off-by: Arpit Patawat <[email protected]>
Copy link
Contributor

❌ Gradle check result for 518c343: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@@ -127,6 +129,11 @@ public class Coordinator extends AbstractLifecycleComponent implements Discovery
public static final long ZEN1_BWC_TERM = 0;

private static final Logger logger = LogManager.getLogger(Coordinator.class);
public static final String NODE_LEFT_REASON_LAGGING = "lagging";
public static final String NODE_LEFT_REASON_DISCONNECTED = "disconnected";
public static final String NODE_LEFT_REASON_HEALTHCHECK_FAIL = "health check failed";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add this as a single string and follow telemetry standards like health.check.failed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just the name for the actual reason, however the requested convention is done.

Comment on lines 215 to 216

clusterManagerMetrics.incrementCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this logic outside.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the requested changes, this metric is no longer required, removed these changes.

Signed-off-by: Arpit Patawat <[email protected]>
Copy link
Contributor

❕ Gradle check result for ff5e7ed: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Apr 29, 2025

Codecov Report

Attention: Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.57%. Comparing base (54e02a7) to head (ff5e7ed).
Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
...g/opensearch/cluster/coordination/Coordinator.java 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17949      +/-   ##
============================================
+ Coverage     72.47%   72.57%   +0.10%     
- Complexity    67034    67150     +116     
============================================
  Files          5478     5473       -5     
  Lines        310132   310102      -30     
  Branches      45087    45058      -29     
============================================
+ Hits         224775   225069     +294     
+ Misses        66960    66697     -263     
+ Partials      18397    18336      -61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants