-
Notifications
You must be signed in to change notification settings - Fork 69
[telemetry] Detect service rechability issues #3426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add time taken to receive the first response (headers) from the user service - Add Time taken to replay the journal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great observability addition, thank you @muhamadazmy! Left one minor naming comment which you should feel free to ignore.
pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks"; | ||
pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds"; | ||
pub const INVOKER_SERVICE_DOWN_ERRORS: &str = "restate.invoker.service_down_errors.total"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky naming observation: Wondering if "unavailable" or "unreachable" might not be more accurate than "down" - since we can't tell authoritatively that it's really down, just that it's not available from our point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah very nice. Thank you. Will apply :)
Summary: Introducing a counter for number of "rechability" issues for a service that can detect a service is down or un-responsive by visualizing `rate(restate.invoker.service_unreachable_errors.total)` coupled with alerts, operator can know when a service is facing connectivity problems
@@ -21,6 +21,8 @@ pub const INVOKER_TASK_DURATION: &str = "restate.invoker.task_duration.seconds"; | |||
pub const INVOKER_SERVICE_RESPONSE_TIME: &str = "restate.invoker.service_response_time.seconds"; | |||
pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks"; | |||
pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds"; | |||
pub const INVOKER_SERVICE_UNREACHABLE_ERRORS: &str = | |||
"restate.invoker.service_unreachable_errors.total"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean deployment
@@ -1090,6 +1090,11 @@ where | |||
.remove_invocation_with_epoch(partition, &invocation_id, invocation_epoch) | |||
{ | |||
debug_assert_eq!(invocation_epoch, ism.invocation_epoch); | |||
|
|||
if self.is_service_down_error(&error) { | |||
counter!(INVOKER_SERVICE_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string()).increment(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have both the deployment id and the service name. The risk is that it this will be a high cardinality metric.
[telemetry] Detect service rechability issues
Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive
by visualizing
rate(restate.invoker.service_unreachable_errors.total)
coupled with alerts, operatorcan know when a service is facing connectivity problems
Stack created with Sapling. Best reviewed with ReviewStack.