Skip to content

[healthcheckv2] Log details in http/grpc handlers when components are not okay #39058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

douglascamata
Copy link
Member

@douglascamata douglascamata commented Mar 31, 2025

Description

This PRs adds a small log at warn level when the healthcheckv2extension detects the pipeline in a not-ok state.

It adapts the log according to the verbosity URL parameter for the HTTP healthcheck handler. The GRPC only gives the "concise" answer and it doesn't carry information about the verbosity in the request (at least yet).

This should help debugging situations where, for instance, the healthcheck endpoint is being accessed by k8s with probes to determine the readiness/liveness of the Collector and for unknown reasons the pipeline isn't healthy.

Testing

Ran it locally to reproduce the scenario and capture some example logs.

Here are the "concise" logs:

Concise logs sample

{"level":"warn","ts":"2025-03-28T17:29:33.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"extensions","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:29:33.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:metrics","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:29:48.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"extensions","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:29:48.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:metrics","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:29:48.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:29:48.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs/resource_catalog","pipeline.status":"StatusStarting"}

And here's the verbose ones:

Verbose logs sample

{"level":"warn","ts":"2025-03-28T17:31:54.069Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"extensions","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.069Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"extension:healthcheckv2","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.069Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"extension:pprof","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.070Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"extension:zpages","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.070Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"extension:opamp","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.070Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"extension:k8s_observer","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.071Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs/resource_catalog","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.071Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:transform/entity-event","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.071Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:resourcedetection/resource_catalog","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.072Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"exporter:coralogix/resource_catalog","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.072Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:batch","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:31:54.073Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:resource/metadata","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.069Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs/resource_catalog","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.069Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"exporter:coralogix/resource_catalog","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.069Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:batch","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.071Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:resource/metadata","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.071Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:filter/workflow","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.074Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:transform/entity-event","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:resourcedetection/resource_catalog","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:memory_limiter","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"receiver:k8sobjects/resource_catalog","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"exporter:coralogix","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:metrics","pipeline.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:28","msg":"component not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","component.instance.id":"processor:transform/k8s_attributes","component.status":"StatusStarting"}
{"level":"warn","ts":"2025-03-28T17:32:09.077Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"extensions","pipeline.status":"StatusStarting"}

for pipeline, pipelineStatus := range s.aggregator.FindNotOk(status.Verbosity(verbose)) {
s.telemetry.Logger.Warn("pipeline not ok", common.HealthLogFields("pipeline", pipeline, pipelineStatus)...)
for component, componentSt := range pipelineStatus.ComponentStatusMap {
s.telemetry.Logger.Warn("component not ok", common.HealthLogFields("component", component, componentSt)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating all those objects, could you make it walk through the statuses and log directly? Is there a reason to iterate this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoulme this would only work if I created some sort of iterator function in the aggregator. I have to do the copy (and you'll notice I use helpers already present to properly do a deep copy of the objects) because they might be modified by a concurrent Go routine, creating a data race. All methods of the aggregator are using a mutex to coordinate access to the aggregated statuses.

Downside of the iterator approach will be that I might (totally not sure) have to hold the read lock for a little longer than copying them out for a very quick use.

Happy to go for the iterator if you think it's better though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to improve the logs anyway. I think I can make them a single line per pipeline that will include a field per component of the pipeline that isn't healthy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoulme what do you think of this new and more succinct format for the "verbose" logs, that include all components inside a pipeline? Besides the fields pipeline.instance.id and pipeline.status to identify the pipeline, it contains a field per component with their status, i.e. "processor:batch":"StatusStarting".

Example simpler verbose logs

{"level":"warn","ts":"2025-04-07T14:12:25.734Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs/resource_catalog","pipeline.status":"StatusStarting","processor:resourcedetection/resource_catalog":"StatusStarting","exporter:coralogix/resource_catalog":"StatusStarting","processor:batch":"StatusStarting","processor:resource/metadata":"StatusStarting","processor:filter/workflow":"StatusStarting","processor:transform/entity-event":"StatusStarting"}
{"level":"warn","ts":"2025-04-07T14:12:25.734Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"extensions","pipeline.status":"StatusStarting","extension:zpages":"StatusStarting","extension:k8s_observer":"StatusStarting","extension:opamp":"StatusStarting","extension:healthcheckv2":"StatusStarting","extension:pprof":"StatusStarting"}
{"level":"warn","ts":"2025-04-07T14:12:25.735Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:metrics","pipeline.status":"StatusStarting","processor:memory_limiter":"StatusStarting","processor:metricstransform/k8s-dashboard":"StatusStarting","processor:transform/prometheus":"StatusStarting","processor:resource/metadata":"StatusStarting","processor:filter/k8s_extra_metrics":"StatusStarting","processor:transform/k8s_attributes":"StatusStarting","processor:resourcedetection/region":"StatusStarting","processor:batch":"StatusStarting","receiver:otlp":"StatusStarting","exporter:coralogix":"StatusStarting","processor:resourcedetection/env":"StatusStarting","processor:transform/k8s-dashboard":"StatusStarting","processor:k8sattributes":"StatusStarting"}
{"level":"warn","ts":"2025-04-07T14:12:25.735Z","caller":"http/handlers.go:26","msg":"pipeline not ok","otelcol.component.id":"healthcheckv2","otelcol.component.kind":"Extension","pipeline.instance.id":"pipeline:logs","pipeline.status":"StatusStarting","processor:memory_limiter":"StatusStarting","exporter:coralogix":"StatusStarting","processor:transform/k8s_attributes":"StatusStarting","processor:resource/kube-events":"StatusStarting","receiver:k8sobjects":"StatusStarting","processor:batch":"StatusStarting","processor:k8sattributes":"StatusStarting","processor:transform/kube-events":"StatusStarting","processor:resource/metadata":"StatusStarting"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really read this, tbh. I get that if I was on-call and seeing this log, I'd drop into a text editor, prettify the JSON and get to read the details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, @atoulme. At least the information is there now and can be extracted to be better viewed in a rich editor with with formatting and other features to help visualize a json. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the two options: a very long line with lots of information in each one or multiple smaller lines to convey the same information.

@atoulme
Copy link
Contributor

atoulme commented Apr 5, 2025

I don't really get it ; you want to log those warnings based on http and grpc handlers. Those are side effects of calling the functions that report status. They can be triggered by HTTP requests and gRPC requests and can flood the logs.

The logs only report components which are not "ok" - maybe you mean healthy.

This seems to me like you're trying to be able to interrogate the collector but see the information it reports via its logs instead of the response of the handlers.

Do you want to:

  • do this but then let's do it in periodical task that runs every 30s?
  • do this but read the response with all the information?

Additional thoughts:

  • you are copying and manipulating objects, how about just walk the objects and log the info directly?
  • this needs tests.

@atoulme atoulme marked this pull request as draft April 7, 2025 04:23
@douglascamata
Copy link
Member Author

douglascamata commented Apr 7, 2025

Hey @atoulme, I'm happy to clarify some points for you:

I don't really get it ; you want to log those warnings based on http and grpc handlers. Those are side effects of calling the functions that report status. They can be triggered by HTTP requests and gRPC requests and can flood the logs.

The health of each component, respecting the verbosity parameter, is already returned in the response body of the HTTP/gRPC handlers.

Clarifying also what are these handlers that we talk about here: they are supposed to expose healthcheck information that is aggregated from all the components and pipelines. So keep in mind that a very important use case here is to use these endpoints are readiness/liveness probes on a pod.

The logs only report components which are not "ok" - maybe you mean healthy.

Not exactly. "not ok" could mean "starting", or "stopping", or "stopped". I am only interested in anything that could be one of these 3 statuses plus the error ones (recoverable, permanent, or fatal). I didn't want to name things differently than what's already present in the healthcheckv2 extension.

This seems to me like you're trying to be able to interrogate the collector but see the information it reports via its logs instead of the response of the handlers.

Do you want to:

  • do this but then let's do it in periodical task that runs every 30s?
  • do this but read the response with all the information?

The point of these changes is to allow anyone that use the healthcheckv2 as a readiness/liveness probe to be able to know why and what is failing their checks. Today, you won't know what component or pipeline is having issues because the information is only present in the response body of the HTTP/gRPC request -- which is blissfully ignored by the k8s probes.

An example: if use readiness/liveness probes you have a pipeline not becoming healthy because of a components that is also not logging why it has issues, you won't know the reasons behind this unless you manually send a request to the healthcheck endpoint yourself. Personally I've faced this scenario, which is why I wanted to improve it.

It would be great if the k8s probes had a better way of detecting a "reason" for the probes' failure, but this is way out of scope.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 22, 2025
@douglascamata
Copy link
Member Author

@evan-bradley is OoO until 05/05 (see #39285). Maybe someone else could have a look at this? @mwear?

@douglascamata
Copy link
Member Author

Anyone available to have a look at this, please? 🥺

@evan-bradley
Copy link
Contributor

evan-bradley commented May 20, 2025

@douglascamata sorry for the delayed review.

I see what problem you're trying to solve, but the UX feels a little off for "on-site" debugging (i.e. looking directly at a Collector instance for debugging), and the healthcheck extension doesn't seem like the right place if these logs are intended for viewing in a backend.

For debugging the Collector directly, I think working with the information returned by the endpoint makes the most sense, either locally through cURL or by configuring the liveness probes. According to the k8s docs, there's an exec option for liveness and readiness probes, are logs sufficient to leverage that to print debugging information?

For handling this more at a fleet level where this information may end up in a backend, I would prefer:

  1. For the Collector service to be the one to collect and emit status events through an OTel logger/SDK. Users could then toggle and control this logging through the SDK configuration.
  2. For pipeline events to be aggregated inside a receiver then emitted to a Collector pipeline.

@mwear do you have any thoughts as code owner?

@douglascamata
Copy link
Member Author

douglascamata commented May 20, 2025

For debugging the Collector directly, I think working with the information returned by the endpoint makes the most sense, either locally through cURL or by configuring the liveness probes. According to the k8s docs, there's an exec option for liveness and readiness probes, are logs sufficient to leverage that to print debugging information?

It's effectively impossible to do any of this that you propose, @evan-bradley :/

Setting up the liveness probes (which is what got me to propose these changes in the 1st place) will only tell you "things are failing, I'm killing this pod" but nothing will not tell you what was failing.

You can't do anything locally through cURL because the official images are based on scratch. And even if you could cURL from inside the Collector pod, it might be killed before you get to read and understand the output (or even before you finished typing the cURL command). You will end up racing against your own liveness probe configuration, debugging with the pressure that at any moment that pod will be gone.

If we change the image to be based on something with cURL inside it, then it becomes possible to do something!

  1. For the Collector service to be the one to collect and emit status events through an OTel logger/SDK. Users could then toggle and control this logging through the SDK configuration.
  2. For pipeline events to be aggregated inside a receiver then emitted to a Collector pipeline.

Hummm, I'm not sure if 2 is a good idea. What would you do if the pipeline that you emit to is the one that's not healthy? I believe these logs have to go through a "100% safe" log pipeline.

I think the only way out might be option 1. It has been a while and I lost the contextual knowledge I had on this, so I'm not aware of how feasible or difficult this alternative would be.

@evan-bradley
Copy link
Contributor

Setting up the liveness probes (which is what got me to propose these changes in the 1st place) will only tell you "things are failing, I'm killing this pod" but nothing will not tell you what was failing.

Are you able to see logs on the probes? I would think you can write a script that will print out the response body from the health check.

You can't do anything locally through cURL because the official images are based on scratch

Would it be possible to query the Collector from a local machine, or from another pod in the cluster?

I think the only way out might be option 1. It has been a while and I lost the contextual knowledge I had on this, so I'm not aware of how feasible or difficult this alternative would be.

From our discussion on the 2025-05-21 SIG call today, I agree option 1 is likely our best option. It would most likely involve just printing a debug log from someplace like here with the information in the status event. The SDK would then pick this up and emit it, or it would be available through stderr. There may be more involved, but that's my first impression.

@douglascamata
Copy link
Member Author

Are you able to see logs on the probes? I would think you can write a script that will print out the response body from the health check.

Can't run a script because the container image is scratch, which doesn't contain a shell, cURL, wget, or even any kind of package manager.

Would it be possible to query the Collector from a local machine, or from another pod in the cluster?

Only if I expose the port (not normal for a healthcheck endpoint) or do a port-forward. Otherwise it's only accessible from other containers in the same pod. So you could use the new kubectl debug stuff that takes advantage of the ephemeral containers (recent k8s features), but when liveness probes fails all of the containers get restarted, so the race against the clock stays.

From our discussion on the 2025-05-21 SIG call today, I agree option 1 is likely our best option. It would most likely involve just printing a debug log from someplace like here with the information in the status event. The SDK would then pick this up and emit it, or it would be available through stderr. There may be more involved, but that's my first impression.

Yep, agreed. I might have some time to do this, but I cannot guarantee. So I think the best way forward is to translate the need into an issue and then close this PR. Do you agree, @evan-bradley?

@douglascamata
Copy link
Member Author

Created the issue #40325 to have a record of what we want to implement and included the suggestion from @evan-bradley there. Closing this for now as this is not the way that we want to move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants