Description
Describe the bug
I'm trying to use "check_collector_pipeline" function of healthcheckextension. However, the health check always returns status code 200 to me. Even when my exporter is failing, health check is still OK.
In "additional context" section I break down the technical aspect of such behavior.
Steps to reproduce
I built a simple OTel collector:
Then I build a Docker image out of that agent, and I run that image (not showing commands here, because they are specific to my setup).
Then I run that image, docker attach
to it, and issue curl
requests to the health endpoint (shown below).
What did you expect to see?
I expected to see status code 500 coming from health checks, because the exporter I'm using is constantly failing (on purpose).
What did you see instead?
However, the health check returns status code 500 only for a limited amount of time. Then, it always returns status code 200: https://pastebin.com/uR5X6DAr.
What version did you use?
v0.54.0
What config did you use?
agent-conf.yaml
Environment
OS: Debian GNU/Linux rodete, kernel version 5.17.11-1rodete2-amd64
Compiler(if manually compiled): go 1.18.3
Additional context
I investigated the problem, and I found out a technical cause. However, I'm not sure how to fix this properly.
I added more logging to the healthcheckextension: healthcheckextension.go, exporter.go.
Here are my OpenTelemetry logs coming from the run (beware, those are very verbose): https://pastebin.com/F2ZJQhAf.
Notice that the HealthCheckExtension's internal queue exporterFailureQueue
is growing at first (log message HC queue size: 0
, then is 1, 2, ..., 6). Then it becomes empty (size 0) or contains one element (log message exiting `rotate` function, queue size is 1
).
The extension itself decides whether exporter is healthy based on the size of the queue. Since it has <= 1 elements all the time, the health check returns 200 OK all the time.
The exporter itself was constantly failing (log messages containing Exporting failed.
), but the queue was not growing. Why? Take for example failure at 2022-06-28T14:35:48.345Z error exporterhelper/queued_retry.go:149 Exporting failed.
. There was one element added to the exporterFailureQueue
, with the following timestamps:
Start: (time.Time) 2022-06-28 14:33:18.342453932 +0000 UTC m=+0.039182067,
End: (time.Time) 2022-06-28 14:35:48.306595188 +0000 UTC m=+150.003323321,
Then, the extension compared Start with the current time, and removed the element from the exporterFailureQueue
(the current time at that point was 2022-06-28 14:35:48.344996512 +0000
).
Overall, notice that the Start
of every element of exporterFailureQueue
is constant: 14:33:18
. It corresponds to the start time of the collector.
However, exporter.go
compares Start
with current time as if Start
was the time of the exporter failure.
Thus, as long as collector was started within health_check::check_collector_pipeline::interval
from current time, the health check works fine. After the interval
is crossed, health check always returns 200 OK.
Solutions that I can imagine:
- compare
vd.End
instead ofvd.Start
with current time here - populate more accurate
Start
time in obsreportconfig.go