Add FS health check Failure metric #17949

arpitpatawat · 2025-04-15T11:10:49Z

Description

This PR will add a metric when there is FS health check failure. The tag will contain detail about the node whose FS health check has failed. This PR also adds the node tag to leader/follower check failure metric.

Related Issues

None

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-04-15T11:50:35Z

❌ Gradle check result for ebe8756: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

owaiskazi19 · 2025-04-15T17:35:03Z

@arpitpatawat related failures. Can you fix them and we should be good to get this in?

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testLoggingOnHungIO" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testLoggingOnHungIO FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:52392A5319A37360]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnHungIOBeyondHealthyTimeout FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:9A8CB2C2BE593BE9]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testSchedulesHealthCheckAtRefreshIntervals" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testSchedulesHealthCheckAtRefreshIntervals FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:6755A9BE8FD65332]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnSinglePathWriteFailure" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnSinglePathWriteFailure FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:49BF9A3A99B7457D]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnIOException" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnIOException FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:8FB87904E30A3C74]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnSinglePathFsyncFailure" -Dtests.seed=18F8ECCBCAA54DA1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-Arab-PK -Dtests.timezone=PNT -Druntime.java=21

FsHealthServiceTests > testFailsHealthOnSinglePathFsyncFailure FAILED
    java.lang.NullPointerException: Cannot invoke "org.opensearch.telemetry.metrics.MetricsRegistry.createCounter(String, String, String)" because "this.metricsRegistry" is null
        at __randomizedtesting.SeedInfo.seed([18F8ECCBCAA54DA1:57DE9DA58242E08]:0)
        at org.opensearch.monitor.fs.FsHealthServiceTests.createObjects(FsHealthServiceTests.java:93)

github-actions · 2025-04-16T07:47:49Z

❌ Gradle check result for 4450222: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-17T08:20:07Z

❌ Gradle check result for 166088c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

nishchay21 · 2025-04-18T11:42:51Z

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

@@ -213,6 +229,7 @@ private void monitorFSHealth() {
            } catch (IllegalStateException e) {
                logger.error("health check failed", e);
                brokenLock = true;
+                fsHealthFailCounter.add(1.0, Tags.create().addTag(NODE_ID, nodeEnv.nodeId()));


Any reason to add the metrics during exception as well? Ideally this is not FS issue we are not able to get the nodeDataPaths

Although the exception is due to failure while retrieving the nodeDataPaths however this is logged as health check failure and makes the broken lock true. Upon checking the getHealth() method in the same class, it marks the status as unhealthy due to broken lock being true. Another reason is if there is an error while retrieving the paths, we would not be able to proceed with creating temp file and checking the FS health status, so I thought to add the failure metric during exception. Let me know your thoughts otherwise.

nishchay21 · 2025-04-18T11:43:51Z

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

@@ -423,7 +426,11 @@ public String executor() {
        }

        void failNode(String reason) {
-            clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerChecksFailureCounter, 1.0);
+            clusterManagerMetrics.incrementCounter(


Lets have both the metrics as this will then be a breaking change for downstream

Gaganjuneja · 2025-04-18T04:04:13Z

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

 import java.util.Set;
 import java.util.function.BiConsumer;
 import java.util.function.Consumer;
 import java.util.function.Predicate;

 import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap;
 import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY;
+import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID;


Why are you using tracing attribute here?

I wanted to use the key node_id for the tag attribute and I found this to be already existing in the codebase, so consumed it.

This is for traces not metrics.

Gaganjuneja · 2025-04-18T11:32:10Z

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

@@ -423,7 +426,11 @@ public String executor() {
        }

        void failNode(String reason) {
-            clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerChecksFailureCounter, 1.0);
+            clusterManagerMetrics.incrementCounter(


This is a breaking change. Can we add new metric instead?

ok, I will add a new metric for both.

Gaganjuneja · 2025-04-18T11:49:03Z

@Bukhtawar @rajiv-kv Can you please take a look at this PR.

rajiv-kv · 2025-04-18T12:29:03Z

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

+            clusterManagerMetrics.incrementCounter(
+                clusterManagerMetrics.followerChecksFailureCounter,
+                1.0,
+                Optional.ofNullable(Tags.create().addTag(NODE_ID, discoveryNode.getId()))


Why do we require to tag the metric with node_id ? Given that the attribute value will be unique per node, how would this be useful for aggregation / montioring ?

NodeId will help us know on which node the check has failed. As this will be captured on master. Do let me know your thoughts @rajiv-kv

@nishchay21 node-ip is present in the logs of cluster-manager. Would that information suffice ?

If a tag is associated with the metric, i assume this will translate to one follower-check metric for every node in the cluster. Wouldn't this incur additional memory overhead in the active cluster-manager besides the cost of emitting per-node metric ?

I agree but this would only be populated if the node is lagging and will get populated for those nodes itself. We majorly want to capture the reason that there is a node drop and correlate the reason to lagging. We also looked at the node-left part to enable metrics on the reason. Do you feel it is better to add there then correlate from here?

rajiv-kv · 2025-04-18T12:38:01Z

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

+
+            if (currentUnhealthyPaths != null && !currentUnhealthyPaths.isEmpty()) {
+                fsHealthFailCounter.add(1.0, Tags.create().addTag(NODE_ID, nodeEnv.nodeId()));
+            }


Would it be better to emit this metric from the caller on failures ? In that way we also have the context of leader/follower/preVoteCollector on which the fscheck failed.

So if caller is not able to get response back of the health check not sure if we will be able to return back this to caller as well. Wouldn't it be helpful to build this correlation outside?
For follower leader on caller might still be helpful thoughts?

Additionally, the monitorFSHealth() method is called periodically, hence the metric will be published as soon the issue is detected, independent to the caller.

I think in that case, FsHealthMonitor might be the correct place to emit this metric, since the periodicity of the metric will be same across all the nodes in the cluster.

FollowerChecker and LeaderChecker frequency depends on the settings configuration and the type of node (follower / leader)

+1 I agree the instrumentation should be done independently. Consider the case where the node has been kicked out of the cluster(no follower checker), the join requests to the cluster gets blocked if the node isn't healthy. Instrumenting at a monitor level will give us actual metrics on the FS health

github-actions · 2025-04-22T11:09:35Z

❌ Gradle check result for c4c65b3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-22T13:07:53Z

❌ Gradle check result for 3522d20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Gaganjuneja · 2025-04-23T06:30:34Z

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

 import java.util.Set;
 import java.util.function.BiConsumer;
 import java.util.function.Consumer;
 import java.util.function.Predicate;

 import static org.opensearch.common.util.concurrent.ConcurrentCollections.newConcurrentMap;
 import static org.opensearch.monitor.StatusInfo.Status.UNHEALTHY;
+import static org.opensearch.telemetry.tracing.AttributeNames.NODE_ID;


This is for traces not metrics.

Gaganjuneja · 2025-04-23T09:51:28Z

server/src/main/java/org/opensearch/cluster/coordination/LagDetector.java

+            long differenceInVersion = version - appliedVersion;
+            clusterManagerMetrics.incrementCounter(
+                clusterManagerMetrics.lagCounter,
+                (double) differenceInVersion,


We shouldn't publish version diff.

rajiv-kv · 2025-04-23T16:35:59Z

server/src/main/java/org/opensearch/cluster/ClusterManagerMetrics.java

@@ -83,7 +96,11 @@ public ClusterManagerMetrics(MetricsRegistry metricsRegistry) {
            "Counter for number of successful async fetches",
            COUNTER_METRICS_UNIT
        );
-
+        lagCounter = metricsRegistry.createCounter(


What is the usecase of emitting this metric ? We will have to consider Term and Version together for determining the cluster-state version.

Bukhtawar

To ensure consistency, at the end of the periodic run should we add a method emitMetrics which invokes getHealth method to get a consistent state of the health and accordingly emit a metrics?
I feel that should simplify the code and make it more maintainable

Signed-off-by: Arpit Patawat <[email protected]>

github-actions · 2025-04-24T10:41:08Z

❌ Gradle check result for 518c343: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

nishchay21 · 2025-04-28T07:47:49Z

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

@@ -127,6 +129,11 @@ public class Coordinator extends AbstractLifecycleComponent implements Discovery
    public static final long ZEN1_BWC_TERM = 0;

    private static final Logger logger = LogManager.getLogger(Coordinator.class);
+    public static final String NODE_LEFT_REASON_LAGGING = "lagging";
+    public static final String NODE_LEFT_REASON_DISCONNECTED = "disconnected";
+    public static final String NODE_LEFT_REASON_HEALTHCHECK_FAIL = "health check failed";


Can we please add this as a single string and follow telemetry standards like health.check.failed

There are just the name for the actual reason, however the requested convention is done.

server/src/main/java/org/opensearch/cluster/coordination/LagDetector.java

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

Bukhtawar · 2025-04-28T08:36:00Z

server/src/main/java/org/opensearch/cluster/coordination/LagDetector.java

+
+            clusterManagerMetrics.incrementCounter(


Lets move this logic outside.

As per the requested changes, this metric is no longer required, removed these changes.

Signed-off-by: Arpit Patawat <[email protected]>

github-actions · 2025-04-29T02:33:40Z

❕ Gradle check result for ff5e7ed: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2025-04-29T02:33:58Z

Codecov Report

Attention: Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.57%. Comparing base (54e02a7) to head (ff5e7ed).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
...g/opensearch/cluster/coordination/Coordinator.java	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #17949      +/-   ##
============================================
+ Coverage     72.47%   72.57%   +0.10%     
- Complexity    67034    67150     +116     
============================================
  Files          5478     5473       -5     
  Lines        310132   310102      -30     
  Branches      45087    45058      -29     
============================================
+ Hits         224775   225069     +294     
+ Misses        66960    66697     -263     
+ Partials      18397    18336      -61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arpitpatawat changed the title ~~Add FS health check metric~~ Add FS health check Failure metric Apr 15, 2025

arpitpatawat force-pushed the node-metrics branch from bacbcd9 to 4450222 Compare April 16, 2025 07:05

opensearch-ci-bot mentioned this pull request Apr 16, 2025

[AUTOCUT] Gradle Check Flaky Test Report for RangeAggregatorTests #17928

Closed

nishchay21 reviewed Apr 18, 2025

View reviewed changes

Gaganjuneja reviewed Apr 18, 2025

View reviewed changes

rajiv-kv reviewed Apr 18, 2025

View reviewed changes

Gaganjuneja reviewed Apr 23, 2025

View reviewed changes

rajiv-kv reviewed Apr 23, 2025

View reviewed changes

Bukhtawar reviewed Apr 23, 2025

View reviewed changes

arpitpatawat added 3 commits April 24, 2025 15:45

Add FS health check metric

4932641

Signed-off-by: Arpit Patawat <[email protected]>

Fix failing UT

7432fe0

Signed-off-by: Arpit Patawat <[email protected]>

Fix UT bugs

5e0deba

Signed-off-by: Arpit Patawat <[email protected]>

arpitpatawat force-pushed the node-metrics branch from 3522d20 to 518c343 Compare April 24, 2025 10:25

nishchay21 reviewed Apr 28, 2025

View reviewed changes

server/src/main/java/org/opensearch/cluster/coordination/LagDetector.java Outdated Show resolved Hide resolved

nishchay21 reviewed Apr 28, 2025

View reviewed changes

server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java Outdated Show resolved Hide resolved

Bukhtawar reviewed Apr 28, 2025

View reviewed changes

arpitpatawat added 5 commits April 29, 2025 06:59

Add new metric for leader-follower check

1d62c03

Signed-off-by: Arpit Patawat <[email protected]>

Add Lag detector metric

4453e29

Signed-off-by: Arpit Patawat <[email protected]>

Add node-left metric

80b0a53

Signed-off-by: Arpit Patawat <[email protected]>

Remove Lag detector metric

02994a1

Signed-off-by: Arpit Patawat <[email protected]>

Remove leader/follower check metrics

8de972a

Signed-off-by: Arpit Patawat <[email protected]>

arpitpatawat force-pushed the node-metrics branch from 78c2721 to c3ad4a8 Compare April 29, 2025 01:29

Minor fixes

ff5e7ed

Signed-off-by: Arpit Patawat <[email protected]>

arpitpatawat force-pushed the node-metrics branch from c3ad4a8 to ff5e7ed Compare April 29, 2025 01:34

This was referenced Apr 29, 2025

[AUTOCUT] Gradle Check Flaky Test Report for ResourceAwareTasksTests #14293

Open

[AUTOCUT] Gradle Check Flaky Test Report for DedicatedClusterSnapshotRestoreIT #15806

Open

Add FS health check Failure metric #17949

Are you sure you want to change the base?

Add FS health check Failure metric #17949

Conversation

arpitpatawat commented Apr 15, 2025

Description

Related Issues

Check List

github-actions bot commented Apr 15, 2025

owaiskazi19 commented Apr 15, 2025

github-actions bot commented Apr 16, 2025

github-actions bot commented Apr 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gaganjuneja commented Apr 18, 2025

Choose a reason for hiding this comment

nishchay21 Apr 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 22, 2025

github-actions bot commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 29, 2025

codecov bot commented Apr 29, 2025

Codecov Report

nishchay21 Apr 18, 2025 •

edited

Loading