[telemetry] Detect service rechability issues #3426

muhamadazmy · 2025-06-19T14:07:10Z

[telemetry] Detect service rechability issues

Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive

by visualizing rate(restate.invoker.service_unreachable_errors.total) coupled with alerts, operator
can know when a service is facing connectivity problems

Stack created with Sapling. Best reviewed with ReviewStack.

- Add time taken to receive the first response (headers) from the user service - Add Time taken to replay the journal

muhamadazmy · 2025-06-19T14:07:44Z

pcholakov

This is a great observability addition, thank you @muhamadazmy! Left one minor naming comment which you should feel free to ignore.

pcholakov · 2025-06-19T21:38:14Z

crates/invoker-impl/src/metric_definitions.rs

 pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
+pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
+pub const INVOKER_SERVICE_DOWN_ERRORS: &str = "restate.invoker.service_down_errors.total";


Nitpicky naming observation: Wondering if "unavailable" or "unreachable" might not be more accurate than "down" - since we can't tell authoritatively that it's really down, just that it's not available from our point of view.

Ah very nice. Thank you. Will apply :)

Summary: Introducing a counter for number of "rechability" issues for a service that can detect a service is down or un-responsive by visualizing `rate(restate.invoker.service_unreachable_errors.total)` coupled with alerts, operator can know when a service is facing connectivity problems

AhmedSoliman · 2025-06-20T15:39:23Z

crates/invoker-impl/src/metric_definitions.rs

@@ -21,6 +21,8 @@ pub const INVOKER_TASK_DURATION: &str = "restate.invoker.task_duration.seconds";
 pub const INVOKER_SERVICE_RESPONSE_TIME: &str = "restate.invoker.service_response_time.seconds";
 pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
 pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
+pub const INVOKER_SERVICE_UNREACHABLE_ERRORS: &str =
+    "restate.invoker.service_unreachable_errors.total";


I guess you mean deployment

AhmedSoliman · 2025-06-20T15:40:08Z

crates/invoker-impl/src/lib.rs

@@ -1090,6 +1090,11 @@ where
            .remove_invocation_with_epoch(partition, &invocation_id, invocation_epoch)
        {
            debug_assert_eq!(invocation_epoch, ism.invocation_epoch);
+
+            if self.is_service_down_error(&error) {
+                counter!(INVOKER_SERVICE_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string()).increment(1);


I think we need to have both the deployment id and the service name. The risk is that it this will be a high cardinality metric.

muhamadazmy added 2 commits June 19, 2025 12:49

Set histogram quantiles to [50, 90, 99, and 100%]

1f2b4fa

[telemetry] Improve invoker observability

3bfdcc9

- Add time taken to receive the first response (headers) from the user service - Add Time taken to replay the journal

This was referenced Jun 19, 2025

Set histogram quantiles to [50, 90, 99, and 100%] #3425

Merged

[telemetry] Improve invoker observability #3424

Open

muhamadazmy requested review from AhmedSoliman and pcholakov June 19, 2025 14:07

pcholakov approved these changes Jun 19, 2025

View reviewed changes

muhamadazmy force-pushed the pr3426 branch from 22d67c4 to d4be614 Compare June 20, 2025 07:59

AhmedSoliman reviewed Jun 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[telemetry] Detect service rechability issues #3426

[telemetry] Detect service rechability issues #3426

Uh oh!

muhamadazmy commented Jun 19, 2025 •

edited

Loading

Uh oh!

muhamadazmy commented Jun 19, 2025

Uh oh!

pcholakov left a comment

Uh oh!

pcholakov Jun 19, 2025

Uh oh!

muhamadazmy Jun 20, 2025

Uh oh!

AhmedSoliman Jun 20, 2025

Uh oh!

AhmedSoliman Jun 20, 2025

Uh oh!

Uh oh!

[telemetry] Detect service rechability issues #3426

Are you sure you want to change the base?

[telemetry] Detect service rechability issues #3426

Uh oh!

Conversation

muhamadazmy commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muhamadazmy commented Jun 19, 2025

Uh oh!

pcholakov left a comment

Choose a reason for hiding this comment

Uh oh!

pcholakov Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

muhamadazmy Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muhamadazmy commented Jun 19, 2025 •

edited

Loading