-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[healthcheckv2] Log details in http/grpc handlers when components are not okay #39058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[healthcheckv2] Log details in http/grpc handlers when components are not okay #39058
Conversation
for pipeline, pipelineStatus := range s.aggregator.FindNotOk(status.Verbosity(verbose)) { | ||
s.telemetry.Logger.Warn("pipeline not ok", common.HealthLogFields("pipeline", pipeline, pipelineStatus)...) | ||
for component, componentSt := range pipelineStatus.ComponentStatusMap { | ||
s.telemetry.Logger.Warn("component not ok", common.HealthLogFields("component", component, componentSt)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of creating all those objects, could you make it walk through the statuses and log directly? Is there a reason to iterate this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atoulme this would only work if I created some sort of iterator function in the aggregator. I have to do the copy (and you'll notice I use helpers already present to properly do a deep copy of the objects) because they might be modified by a concurrent Go routine, creating a data race. All methods of the aggregator are using a mutex to coordinate access to the aggregated statuses.
Downside of the iterator approach will be that I might (totally not sure) have to hold the read lock for a little longer than copying them out for a very quick use.
Happy to go for the iterator if you think it's better though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to improve the logs anyway. I think I can make them a single line per pipeline that will include a field per component of the pipeline that isn't healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atoulme what do you think of this new and more succinct format for the "verbose" logs, that include all components inside a pipeline? Besides the fields pipeline.instance.id
and pipeline.status
to identify the pipeline, it contains a field per component with their status, i.e. "processor:batch":"StatusStarting"
.
Example simpler verbose logs
{"level":"warn","ts":"2025-04-07T14:12:25.734Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs/resource_catalog","pipeline.status":"StatusStarting","processor:resourcedetection/resource_catalog":"StatusStarting","exporter:coralogix/resource_catalog":"StatusStarting","processor:batch":"StatusStarting","processor:resource/metadata":"StatusStarting","processor:filter/workflow":"StatusStarting","processor:transform/entity-event":"StatusStarting"}
{"level":"warn","ts":"2025-04-07T14:12:25.734Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"extensions","pipeline.status":"StatusStarting","extension:zpages":"StatusStarting","extension:k8s_observer":"StatusStarting","extension:opamp":"StatusStarting","extension:healthcheckv2":"StatusStarting","extension:pprof":"StatusStarting"}
{"level":"warn","ts":"2025-04-07T14:12:25.735Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:metrics","pipeline.status":"StatusStarting","processor:memory_limiter":"StatusStarting","processor:metricstransform/k8s-dashboard":"StatusStarting","processor:transform/prometheus":"StatusStarting","processor:resource/metadata":"StatusStarting","processor:filter/k8s_extra_metrics":"StatusStarting","processor:transform/k8s_attributes":"StatusStarting","processor:resourcedetection/region":"StatusStarting","processor:batch":"StatusStarting","receiver:otlp":"StatusStarting","exporter:coralogix":"StatusStarting","processor:resourcedetection/env":"StatusStarting","processor:transform/k8s-dashboard":"StatusStarting","processor:k8sattributes":"StatusStarting"}
{"level":"warn","ts":"2025-04-07T14:12:25.735Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs","pipeline.status":"StatusStarting","processor:memory_limiter":"StatusStarting","exporter:coralogix":"StatusStarting","processor:transform/k8s_attributes":"StatusStarting","processor:resource/kube-events":"StatusStarting","receiver:k8sobjects":"StatusStarting","processor:batch":"StatusStarting","processor:k8sattributes":"StatusStarting","processor:transform/kube-events":"StatusStarting","processor:resource/metadata":"StatusStarting"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really read this, tbh. I get that if I was on-call and seeing this log, I'd drop into a text editor, prettify the JSON and get to read the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally, @atoulme. At least the information is there now and can be extracted to be better viewed in a rich editor with with formatting and other features to help visualize a json. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the two options: a very long line with lots of information in each one or multiple smaller lines to convey the same information.
I don't really get it ; you want to log those warnings based on http and grpc handlers. Those are side effects of calling the functions that report status. They can be triggered by HTTP requests and gRPC requests and can flood the logs. The logs only report components which are not "ok" - maybe you mean healthy. This seems to me like you're trying to be able to interrogate the collector but see the information it reports via its logs instead of the response of the handlers. Do you want to:
Additional thoughts:
|
Hey @atoulme, I'm happy to clarify some points for you:
The health of each component, respecting the verbosity parameter, is already returned in the response body of the HTTP/gRPC handlers. Clarifying also what are these handlers that we talk about here: they are supposed to expose healthcheck information that is aggregated from all the components and pipelines. So keep in mind that a very important use case here is to use these endpoints are readiness/liveness probes on a pod.
Not exactly. "not ok" could mean "starting", or "stopping", or "stopped". I am only interested in anything that could be one of these 3 statuses plus the error ones (recoverable, permanent, or fatal). I didn't want to name things differently than what's already present in the healthcheckv2 extension.
The point of these changes is to allow anyone that use the healthcheckv2 as a readiness/liveness probe to be able to know why and what is failing their checks. Today, you won't know what component or pipeline is having issues because the information is only present in the response body of the HTTP/gRPC request -- which is blissfully ignored by the k8s probes. An example: if use readiness/liveness probes you have a pipeline not becoming healthy because of a components that is also not logging why it has issues, you won't know the reasons behind this unless you manually send a request to the healthcheck endpoint yourself. Personally I've faced this scenario, which is why I wanted to improve it. It would be great if the k8s probes had a better way of detecting a "reason" for the probes' failure, but this is way out of scope. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
@evan-bradley is OoO until 05/05 (see #39285). Maybe someone else could have a look at this? @mwear? |
Anyone available to have a look at this, please? 🥺 |
@douglascamata sorry for the delayed review. I see what problem you're trying to solve, but the UX feels a little off for "on-site" debugging (i.e. looking directly at a Collector instance for debugging), and the healthcheck extension doesn't seem like the right place if these logs are intended for viewing in a backend. For debugging the Collector directly, I think working with the information returned by the endpoint makes the most sense, either locally through cURL or by configuring the liveness probes. According to the k8s docs, there's an For handling this more at a fleet level where this information may end up in a backend, I would prefer:
@mwear do you have any thoughts as code owner? |
It's effectively impossible to do any of this that you propose, @evan-bradley :/ Setting up the liveness probes (which is what got me to propose these changes in the 1st place) will only tell you "things are failing, I'm killing this pod" but nothing will not tell you what was failing. You can't do anything locally through If we change the image to be based on something with
Hummm, I'm not sure if 2 is a good idea. What would you do if the pipeline that you emit to is the one that's not healthy? I believe these logs have to go through a "100% safe" log pipeline. I think the only way out might be option 1. It has been a while and I lost the contextual knowledge I had on this, so I'm not aware of how feasible or difficult this alternative would be. |
Are you able to see logs on the probes? I would think you can write a script that will print out the response body from the health check.
Would it be possible to query the Collector from a local machine, or from another pod in the cluster?
From our discussion on the 2025-05-21 SIG call today, I agree option 1 is likely our best option. It would most likely involve just printing a debug log from someplace like here with the information in the status event. The SDK would then pick this up and emit it, or it would be available through stderr. There may be more involved, but that's my first impression. |
Can't run a script because the container image is
Only if I expose the port (not normal for a healthcheck endpoint) or do a port-forward. Otherwise it's only accessible from other containers in the same pod. So you could use the new
Yep, agreed. I might have some time to do this, but I cannot guarantee. So I think the best way forward is to translate the need into an issue and then close this PR. Do you agree, @evan-bradley? |
Created the issue #40325 to have a record of what we want to implement and included the suggestion from @evan-bradley there. Closing this for now as this is not the way that we want to move forward. |
Description
This PRs adds a small log at
warn
level when thehealthcheckv2extension
detects the pipeline in a not-ok state.It adapts the log according to the
verbosity
URL parameter for the HTTP healthcheck handler. The GRPC only gives the "concise" answer and it doesn't carry information about the verbosity in the request (at least yet).This should help debugging situations where, for instance, the healthcheck endpoint is being accessed by k8s with probes to determine the readiness/liveness of the Collector and for unknown reasons the pipeline isn't healthy.
Testing
Ran it locally to reproduce the scenario and capture some example logs.
Here are the "concise" logs:
Concise logs sample
And here's the verbose ones:
Verbose logs sample