Open
Description
Component(s)
exporter/prometheusremotewrite
Is your feature request related to a problem? Please describe.
The prometheusremotewrite exporter currently lacks detailed metrics and logs for export failures. When issues like timeouts or authorization errors occur, users often encounter generic error messages, making it challenging to diagnose and address the root causes effectively.
Problem:
- No clear metrics when queue isn't full but sends fail
- Limited visibility into failure types
{"level":
"error",
"timestamp":"2025-04-25T19:38:22.027Z",
"caller":"exporterhelper/queue_sender.go:90",
"message":"Exporting failed. Dropping data.",
"kind":"exporter",
"data_type":"metrics",
"name":"prometheusremotewrite",
"error":"Permanent error: Permanent error: context deadline exceeded",
"dropped_items":5,
"stack":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender ...
}
Describe the solution you'd like
We propose enhancing the PRW exporter to provide more granular metrics and logs for export failures.
- Granular Failure Metrics:
- Introduce a metric prw_export_failures_total with a reason label to categorize failure types:
- HTTP status code families (4xx, 5xx)
- Specific error types (e.g., "out of order sample", "timeout", "authorization")
- Add prw_export_retries_total to count retry attempts.
- Improved Logging:
- Structured error messages with clear failure categorization
- Include relevant debugging details (status codes, error messages)
- Introduce a metric prw_export_failures_total with a reason label to categorize failure types:
These enhancements would significantly improve observability and help users quickly identify and resolve issues in their data export pipeline, regardless of the specific remote write endpoint they're using (e.g., Prometheus, Cortex, Thanos, or cloud-based solutions)
Describe alternatives you've considered
No response
Additional context
No response