You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Negative Counts Exported for OTLP Summary Metrics due to Incorrect Delta Calculation
Component:github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics/metrics_translator.go (dependency of datadogexporter in opentelemetry-collector-contrib)
Observed Version:opentelemetry-collector-contrib v0.122.0 (containing a corresponding version of opentelemetry-mapping-go)
Symptoms:
Metrics exported to Datadog that originate from OpenTelemetry (OTLP) Summary types exhibit incorrect, large negative values for their .count component (e.g., envoy.cluster.upstream_rq_time.count). This typically occurs immediately following a reset of the underlying cumulative counter in the source application (e.g., an Envoy pod restart).
Diagnosis:
Enabled the debug exporter in the OpenTelemetry Collector alongside the datadog exporter.
Observed the OTLP metric stream using the debug exporter logs.
Confirmed that the OTLP SummaryDataPoint for the affected metric (e.g., envoy.cluster.upstream_rq_time) consistently shows a positiveCount value before being processed by the exporter.
Traced the code path from datadogexporter/metrics_exporter.go into the opentelemetry-mapping-go library dependency.
Identified that the mapSummaryMetrics function within pkg/otlp/metrics/metrics_translator.go handles the conversion of OTLP Summary points (Link to function).
Found that this function incorrectly calls the ttlCache.Diff method to calculate the delta for the Summary's Count field (Link to line). The comments indicate Diff is intended for non-monotonic metrics.
The ttlCache.Diff method calculates newValue - oldValue but lacks the specific logic found in ttlCache.MonotonicDiff to correctly identify and handle counter resets (where newValue might be less than oldValue, resulting in a negative delta).
This incorrectly calculated negative delta is then passed to the exporter's consumer and ultimately sent to Datadog.
Root Cause:
The mapSummaryMetrics function in metrics_translator.go uses ttlCache.Diff (Link) instead of the appropriate ttlCache.MonotonicDiff for the cumulative Count field of OTLP Summary metrics. This leads to negative deltas being generated and exported when the source cumulative counter undergoes a reset.
Code Snippet (Problematic Area in metrics_translator.go):
// count and sum are increasing; we treat them as cumulative monotonic sums.
{
countDims:=pointDims.WithSuffix("count")
// Incorrectly uses Diff for a monotonic cumulative value// Link: https://github.com/DataDog/opentelemetry-mapping-go/blob/main/pkg/otlp/metrics/metrics_translator.go#L584ifdx, ok:=t.prevPts.Diff(countDims, startTs, ts, float64(p.Count())); ok&&!t.isSkippable(countDims.name, dx) {
consumer.ConsumeTimeSeries(ctx, countDims, Count, ts, dx)
}
}
Proposed Fix (in metrics_translator.go):
Replace the call to Diff with MonotonicDiff (compare with line 584) and adjust the consumption logic based on its return values:
// count and sum are increasing; we treat them as cumulative monotonic sums.
{
countDims:=pointDims.WithSuffix("count")
// Use MonotonicDiff, which correctly handles resets for cumulative monotonic values.ifdx, firstPoint, dropPoint:=t.prevPts.MonotonicDiff(countDims, startTs, ts, float64(p.Count())); !dropPoint {
// Only consume if it's not the first point after a start/reset and the value is valid.// 'firstPoint' is true on the actual first point OR after a reset (where dx might be < 0).// Datadog counts represent deltas, so we skip the value on the first point/reset.if!firstPoint&&!t.isSkippable(countDims.name, dx) {
consumer.ConsumeTimeSeries(ctx, countDims, Count, ts, dx)
}
}
// No explicit logging needed for dropPoint, MonotonicDiff handles it internally.
}
Impact:
This bug affects users of the datadogexporter relying on the default translation for OTLP Summary metrics, causing misleading negative spikes in .count metrics within Datadog after source counter resets.
The text was updated successfully, but these errors were encountered:
Negative Counts Exported for OTLP Summary Metrics due to Incorrect Delta Calculation
Component:
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics/metrics_translator.go
(dependency ofdatadogexporter
inopentelemetry-collector-contrib
)Observed Version:
opentelemetry-collector-contrib
v0.122.0 (containing a corresponding version ofopentelemetry-mapping-go
)Symptoms:
Metrics exported to Datadog that originate from OpenTelemetry (OTLP) Summary types exhibit incorrect, large negative values for their
.count
component (e.g.,envoy.cluster.upstream_rq_time.count
). This typically occurs immediately following a reset of the underlying cumulative counter in the source application (e.g., an Envoy pod restart).Diagnosis:
debug
exporter in the OpenTelemetry Collector alongside thedatadog
exporter.debug
exporter logs.SummaryDataPoint
for the affected metric (e.g.,envoy.cluster.upstream_rq_time
) consistently shows a positiveCount
value before being processed by the exporter.datadogexporter/metrics_exporter.go
into theopentelemetry-mapping-go
library dependency.mapSummaryMetrics
function withinpkg/otlp/metrics/metrics_translator.go
handles the conversion of OTLP Summary points (Link to function).ttlCache.Diff
method to calculate the delta for the Summary'sCount
field (Link to line). The comments indicateDiff
is intended for non-monotonic metrics.ttlCache.Diff
method calculatesnewValue - oldValue
but lacks the specific logic found inttlCache.MonotonicDiff
to correctly identify and handle counter resets (wherenewValue
might be less thanoldValue
, resulting in a negative delta).Root Cause:
The
mapSummaryMetrics
function inmetrics_translator.go
usesttlCache.Diff
(Link) instead of the appropriatettlCache.MonotonicDiff
for the cumulativeCount
field of OTLP Summary metrics. This leads to negative deltas being generated and exported when the source cumulative counter undergoes a reset.Code Snippet (Problematic Area in
metrics_translator.go
):Proposed Fix (in
metrics_translator.go
):Replace the call to
Diff
withMonotonicDiff
(compare with line 584) and adjust the consumption logic based on its return values:Impact:
This bug affects users of the
datadogexporter
relying on the default translation for OTLP Summary metrics, causing misleading negative spikes in.count
metrics within Datadog after source counter resets.The text was updated successfully, but these errors were encountered: