Skip to content

[telemetry] Detect service rechability issues #3426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Jun 19, 2025

[telemetry] Detect service rechability issues

Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive

by visualizing rate(restate.invoker.service_unreachable_errors.total) coupled with alerts, operator
can know when a service is facing connectivity problems


Stack created with Sapling. Best reviewed with ReviewStack.

- Add time taken to receive the first response (headers) from the user service
- Add Time taken to replay the journal
@muhamadazmy
Copy link
Contributor Author

image

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great observability addition, thank you @muhamadazmy! Left one minor naming comment which you should feel free to ignore.

pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
pub const INVOKER_SERVICE_DOWN_ERRORS: &str = "restate.invoker.service_down_errors.total";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky naming observation: Wondering if "unavailable" or "unreachable" might not be more accurate than "down" - since we can't tell authoritatively that it's really down, just that it's not available from our point of view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very nice. Thank you. Will apply :)

Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive

by visualizing `rate(restate.invoker.service_unreachable_errors.total)` coupled with alerts, operator
can know when a service is facing connectivity problems
@@ -21,6 +21,8 @@ pub const INVOKER_TASK_DURATION: &str = "restate.invoker.task_duration.seconds";
pub const INVOKER_SERVICE_RESPONSE_TIME: &str = "restate.invoker.service_response_time.seconds";
pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
pub const INVOKER_SERVICE_UNREACHABLE_ERRORS: &str =
"restate.invoker.service_unreachable_errors.total";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean deployment

@@ -1090,6 +1090,11 @@ where
.remove_invocation_with_epoch(partition, &invocation_id, invocation_epoch)
{
debug_assert_eq!(invocation_epoch, ism.invocation_epoch);

if self.is_service_down_error(&error) {
counter!(INVOKER_SERVICE_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string()).increment(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have both the deployment id and the service name. The risk is that it this will be a high cardinality metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants