healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing

**Describe the bug**
I'm trying to use "check_collector_pipeline" function of [healthcheckextension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension). However, the health check always  returns status code 200 to me. Even when my exporter is failing, health check is still OK.
In "additional context" section I break down the technical aspect of such behavior.

**Steps to reproduce**
I built a simple OTel collector:
- [agent-conf.yaml](https://pastebin.com/CKSMkCsM)
- [main.go](https://pastebin.com/tcV87xpY)
- [go.mod](https://pastebin.com/0dmEq4pF)

Then I build a Docker image out of that agent, and I run that image (not showing commands here, because they are specific to my setup). 

Then I run that image, `docker attach` to it, and issue `curl` requests to the health endpoint (shown below).

**What did you expect to see?**
I expected to see status code 500 coming from health checks, because the exporter I'm using is constantly failing (on purpose).

**What did you see instead?**
However, the health check returns status code 500 only for a limited amount of time. Then, it always returns status code 200: https://pastebin.com/uR5X6DAr.

**What version did you use?**
v0.54.0

**What config did you use?**
[agent-conf.yaml](https://pastebin.com/CKSMkCsM)

**Environment**
OS: Debian GNU/Linux rodete, kernel version `5.17.11-1rodete2-amd64`
Compiler(if manually compiled): go 1.18.3

**Additional context**
I investigated the problem, and I found out a technical cause. However, I'm not sure how to fix this properly.

I added more logging to the healthcheckextension: [healthcheckextension.go](https://www.diffchecker.com/SfeRYLL6), [exporter.go](https://www.diffchecker.com/jeKov0gu).
Here are my OpenTelemetry logs coming from the run (beware, those are very verbose): https://pastebin.com/F2ZJQhAf.

Notice that the HealthCheckExtension's internal queue `exporterFailureQueue` is growing at first (log message `HC queue size: 0`, then is 1, 2, ..., 6). Then it becomes empty (size 0) or contains one element (log message ```exiting `rotate` function, queue size is 1```). 
The extension itself decides whether exporter is healthy based on the size of the queue. Since it has <= 1 elements all the time, the health check returns 200 OK all the time.

The exporter itself was constantly failing (log messages containing `Exporting failed.`), but the queue was not growing. Why? Take for example failure at `2022-06-28T14:35:48.345Z	error	exporterhelper/queued_retry.go:149	Exporting failed.`. There was one element added to the `exporterFailureQueue`, with the following timestamps:
```
 Start: (time.Time) 2022-06-28 14:33:18.342453932 +0000 UTC m=+0.039182067,
 End: (time.Time) 2022-06-28 14:35:48.306595188 +0000 UTC m=+150.003323321,
```
Then, the extension [compared Start with the current time](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/exporter.go#L64), and removed the element from the `exporterFailureQueue` (the current time at that point was  `2022-06-28 14:35:48.344996512 +0000`).

Overall, notice that the `Start` of every element of `exporterFailureQueue` is constant: `14:33:18`. It corresponds to the **start time of the collector**. 
However, `exporter.go` compares `Start` with current time as if **`Start` was the time of the exporter failure**.
Thus, as long as collector was started within `health_check::check_collector_pipeline::interval` from current time, the health check works fine. After the `interval` is crossed, health check always returns 200 OK.

Solutions that I can imagine:
- compare `vd.End` instead of `vd.Start` with current time [here](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/exporter.go#L64)
- populate more accurate `Start` time [in obsreportconfig.go](https://github.com/open-telemetry/opentelemetry-collector/blob/release/v0.54.x/internal/obsreportconfig/obsreportconfig.go#L94-L99)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions