Skip to content

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

Open
@ItsLastDay

Description

@ItsLastDay

Describe the bug
I'm trying to use "check_collector_pipeline" function of healthcheckextension. However, the health check always returns status code 200 to me. Even when my exporter is failing, health check is still OK.
In "additional context" section I break down the technical aspect of such behavior.

Steps to reproduce
I built a simple OTel collector:

Then I build a Docker image out of that agent, and I run that image (not showing commands here, because they are specific to my setup).

Then I run that image, docker attach to it, and issue curl requests to the health endpoint (shown below).

What did you expect to see?
I expected to see status code 500 coming from health checks, because the exporter I'm using is constantly failing (on purpose).

What did you see instead?
However, the health check returns status code 500 only for a limited amount of time. Then, it always returns status code 200: https://pastebin.com/uR5X6DAr.

What version did you use?
v0.54.0

What config did you use?
agent-conf.yaml

Environment
OS: Debian GNU/Linux rodete, kernel version 5.17.11-1rodete2-amd64
Compiler(if manually compiled): go 1.18.3

Additional context
I investigated the problem, and I found out a technical cause. However, I'm not sure how to fix this properly.

I added more logging to the healthcheckextension: healthcheckextension.go, exporter.go.
Here are my OpenTelemetry logs coming from the run (beware, those are very verbose): https://pastebin.com/F2ZJQhAf.

Notice that the HealthCheckExtension's internal queue exporterFailureQueue is growing at first (log message HC queue size: 0, then is 1, 2, ..., 6). Then it becomes empty (size 0) or contains one element (log message exiting `rotate` function, queue size is 1).
The extension itself decides whether exporter is healthy based on the size of the queue. Since it has <= 1 elements all the time, the health check returns 200 OK all the time.

The exporter itself was constantly failing (log messages containing Exporting failed.), but the queue was not growing. Why? Take for example failure at 2022-06-28T14:35:48.345Z error exporterhelper/queued_retry.go:149 Exporting failed.. There was one element added to the exporterFailureQueue, with the following timestamps:

 Start: (time.Time) 2022-06-28 14:33:18.342453932 +0000 UTC m=+0.039182067,
 End: (time.Time) 2022-06-28 14:35:48.306595188 +0000 UTC m=+150.003323321,

Then, the extension compared Start with the current time, and removed the element from the exporterFailureQueue (the current time at that point was 2022-06-28 14:35:48.344996512 +0000).

Overall, notice that the Start of every element of exporterFailureQueue is constant: 14:33:18. It corresponds to the start time of the collector.
However, exporter.go compares Start with current time as if Start was the time of the exporter failure.
Thus, as long as collector was started within health_check::check_collector_pipeline::interval from current time, the health check works fine. After the interval is crossed, health check always returns 200 OK.

Solutions that I can imagine:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingextension/healthcheckHealth Check Extensionnever staleIssues marked with this label will be never staled and automatically removedpriority:p2Medium

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions