Skip to content

Negative Counts Exported for OTLP Summary Metrics due to Incorrect Delta Calculation #617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WynnD opened this issue Apr 24, 2025 · 1 comment

Comments

@WynnD
Copy link

WynnD commented Apr 24, 2025

Negative Counts Exported for OTLP Summary Metrics due to Incorrect Delta Calculation

Component: github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics/metrics_translator.go (dependency of datadogexporter in opentelemetry-collector-contrib)

Observed Version: opentelemetry-collector-contrib v0.122.0 (containing a corresponding version of opentelemetry-mapping-go)

Symptoms:

Metrics exported to Datadog that originate from OpenTelemetry (OTLP) Summary types exhibit incorrect, large negative values for their .count component (e.g., envoy.cluster.upstream_rq_time.count). This typically occurs immediately following a reset of the underlying cumulative counter in the source application (e.g., an Envoy pod restart).

Diagnosis:

  1. Enabled the debug exporter in the OpenTelemetry Collector alongside the datadog exporter.
  2. Observed the OTLP metric stream using the debug exporter logs.
  3. Confirmed that the OTLP SummaryDataPoint for the affected metric (e.g., envoy.cluster.upstream_rq_time) consistently shows a positive Count value before being processed by the exporter.
  4. Traced the code path from datadogexporter/metrics_exporter.go into the opentelemetry-mapping-go library dependency.
  5. Identified that the mapSummaryMetrics function within pkg/otlp/metrics/metrics_translator.go handles the conversion of OTLP Summary points (Link to function).
  6. Found that this function incorrectly calls the ttlCache.Diff method to calculate the delta for the Summary's Count field (Link to line). The comments indicate Diff is intended for non-monotonic metrics.
  7. The ttlCache.Diff method calculates newValue - oldValue but lacks the specific logic found in ttlCache.MonotonicDiff to correctly identify and handle counter resets (where newValue might be less than oldValue, resulting in a negative delta).
  8. This incorrectly calculated negative delta is then passed to the exporter's consumer and ultimately sent to Datadog.

Root Cause:

The mapSummaryMetrics function in metrics_translator.go uses ttlCache.Diff (Link) instead of the appropriate ttlCache.MonotonicDiff for the cumulative Count field of OTLP Summary metrics. This leads to negative deltas being generated and exported when the source cumulative counter undergoes a reset.

Code Snippet (Problematic Area in metrics_translator.go):

// count and sum are increasing; we treat them as cumulative monotonic sums.
{
    countDims := pointDims.WithSuffix("count")
    // Incorrectly uses Diff for a monotonic cumulative value
    // Link: https://github.com/DataDog/opentelemetry-mapping-go/blob/main/pkg/otlp/metrics/metrics_translator.go#L584
    if dx, ok := t.prevPts.Diff(countDims, startTs, ts, float64(p.Count())); ok && !t.isSkippable(countDims.name, dx) {
        consumer.ConsumeTimeSeries(ctx, countDims, Count, ts, dx)
    }
}

Proposed Fix (in metrics_translator.go):

Replace the call to Diff with MonotonicDiff (compare with line 584) and adjust the consumption logic based on its return values:

// count and sum are increasing; we treat them as cumulative monotonic sums.
{
    countDims := pointDims.WithSuffix("count")
    // Use MonotonicDiff, which correctly handles resets for cumulative monotonic values.
    if dx, firstPoint, dropPoint := t.prevPts.MonotonicDiff(countDims, startTs, ts, float64(p.Count())); !dropPoint {
        // Only consume if it's not the first point after a start/reset and the value is valid.
        // 'firstPoint' is true on the actual first point OR after a reset (where dx might be < 0).
        // Datadog counts represent deltas, so we skip the value on the first point/reset.
        if !firstPoint && !t.isSkippable(countDims.name, dx) {
            consumer.ConsumeTimeSeries(ctx, countDims, Count, ts, dx)
        }
    }
    // No explicit logging needed for dropPoint, MonotonicDiff handles it internally.
}

Impact:

This bug affects users of the datadogexporter relying on the default translation for OTLP Summary metrics, causing misleading negative spikes in .count metrics within Datadog after source counter resets.

@WynnD
Copy link
Author

WynnD commented Apr 25, 2025

Datadog support ticket also filed: https://help.datadoghq.com/hc/en-us/requests/2116120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant