[processor/tailsampling] invert_match not given precedence when inside and policy #33656

jamesrwhite · 2024-06-19T13:22:31Z

Component(s)

processor/tailsampling

What happened?

Description

From my understanding of the tailsamplingprocessor README whenever an "inverted not sample" decision is made that trace should not be sampled, regardless of any other policies. Specifically it states:

Each policy will result in a decision, and the processor will evaluate them to make a final decision:

When there's an "inverted not sample" decision, the trace is not sampled;

When there's a "sample" decision, the trace is sampled;

When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;

In all other cases, the trace is NOT sampled

The "inverted not sample" decision having the highest precedence seems to be confirmed by how this is actually implemented here.

Based upon that I have been trying to put together a policy similar to the following:

If a trace has a certain attribute AND it only contains 1 span → do not sample it
For all other traces → sample them

In practice our policy is more complicated than this but I've simplified it as much as possible for the purpose of explaining this issue.

The policy looks like this in config:

- name: ignore-single-span-traces-from-certain-jobs
  type: and
  and:
    and_sub_policy:
      - name: not-boring-job
        type: string_attribute
        string_attribute:
          key: job
          values: ["boring"]
          invert_match: true
      - name: at-least-two-spans
        type: span_count
        span_count:
          min_spans: 2
- name: always-sample
  type: always_sample

What I'm finding in practice though is that traces with the job attribute set to boring that only contain a single span are still being sampled. Unless I'm misunderstanding how the tail sampling processor is working this seems like a bug to me.

I debugged this a bit by adding several extra log statements into the processor to see what decisions it was making on each policy and I believe the issue lies in how AND policies handled not sampled decisions here.

As you can see regardless of whether the decision is NotSampled or InvertNotSampled it always returns NotSampled. This is problematic because a NotSampled decision won't take precedence over a Sampled decision so any InverNotSampled decision that occurs within an AND policy is effectively getting downgraded to a NotSampled decision.

If it's useful for confirming this behaviour these were the extra log statements I added:

diff --git a/processor/tailsamplingprocessor/processor.go b/processor/tailsamplingprocessor/processor.go
index b0a5850400..edf36cf931 100644
--- a/processor/tailsamplingprocessor/processor.go
+++ b/processor/tailsamplingprocessor/processor.go
@@ -256,6 +256,10 @@ func (tsp *tailSamplingSpanProcessor) samplingPolicyOnTick() {
    trace.DecisionTime = time.Now()

    decision := tsp.makeDecision(id, trace, &metrics)
+   tsp.logger.Info("Tail Sampling Decision",
+     zap.Int64("spans", trace.SpanCount.Load()),
+     zap.Int32("sampled", int32(decision)),
+   )
    tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, int64(time.Since(startTime)/time.Microsecond))
    tsp.telemetry.ProcessorTailSamplingSamplingTraceDroppedTooEarly.Add(tsp.ctx, metrics.idNotFoundOnMapCount)
    tsp.telemetry.ProcessorTailSamplingSamplingPolicyEvaluationError.Add(tsp.ctx, metrics.evaluateErrorCount)
@@ -309,6 +313,10 @@ func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *sa
        tsp.telemetry.ProcessorTailSamplingCountSpansSampled.Add(ctx, trace.SpanCount.Load(), p.attribute, decisionToAttribute[decision])
      }

+     tsp.logger.Info("Sampling policy decision",
+       zap.Int32("decision", int32(decision)),
+       zap.String("policy", p.name),
+     )
      samplingDecision[decision] = true
    }
  }
@@ -317,10 +325,13 @@ func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *sa
  switch {
  case samplingDecision[sampling.InvertNotSampled]:
    finalDecision = sampling.NotSampled
+   tsp.logger.Info("Setting finalDecision to NotSampled via InvertNotSampled")
  case samplingDecision[sampling.Sampled]:
    finalDecision = sampling.Sampled
+   tsp.logger.Info("Setting finalDecision to Sampled via Sampled")
  case samplingDecision[sampling.InvertSampled] && !samplingDecision[sampling.NotSampled]:
    finalDecision = sampling.Sampled
+   tsp.logger.Info("Setting finalDecision to Sampled via InvertSampled && !NotSampled")
  }

  return finalDecision

I believe the fix for this would be to return the decision so either the NotSampled or InvertNotSampled are returned respectively. I have tested this and it seems to fix the problem.

I've made those changes in a branch here: main...jamesrwhite:opentelemetry-collector-contrib:fix-and-policy-invert-not-sampled.

I'm happy to open a PR with the above changes if you agree that's the correct fix, also if you need any more information please let me know.

Thanks!

Steps to Reproduce

Run collector with attached config
Run sample application

from opentelemetry import trace

tracer = trace.get_tracer('worker')

print('Starting worker..')
with tracer.start_as_current_span('worker') as span:
	span.set_attribute('job', 'boring')
	print('Done')

$ OTEL_SERVICE_NAME=service.worker opentelemetry-instrument python app.py

Review collector logs and see that trace is sampled

Expected Result

Trace is not sampled.

Actual Result

Trace is sampled.

Collector version

0.102.0-dev

Environment information

Environment

OS: MacOS 14.5
Go: 1.21.0
Python: 3.12.4

❯ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN='/Users/james.white/.local/share/mise/installs/go/1.21.0/bin'
GOCACHE='/Users/james.white/Library/Caches/go-build'
GOENV='/Users/james.white/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/james.white/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/james.white/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/Users/james.white/.local/share/mise/installs/go/1.21.0/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/james.white/.local/share/mise/installs/go/1.21.0/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.0'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/james.white/Code/otel/opentelemetry-collector-contrib-fork/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/96/_bvls_n53qg78rl760qw21lm0000gq/T/go-build997096154=/tmp/go-build -gno-record-gcc-switches -fno-common'

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: :4317
      http:
        endpoint: :4318

exporters:
  debug:
    verbosity: detailed

processors:
  tail_sampling:
    policies:
      - name: ignore-single-span-traces-from-certain-jobs
        type: and
        and:
          and_sub_policy:
            - name: not-boring-job
              type: string_attribute
              string_attribute:
                key: job
                values: ["boring"]
                invert_match: true
            - name: at-least-two-spans
              type: span_count
              span_count:
                min_spans: 2
      - name: always-sample
        type: always_sample

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [debug]

Log output

$ make run
cd ./cmd/otelcontribcol && GO111MODULE=on go run --race . --config ../../local/config.yaml
# github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol
ld: warning: ignoring duplicate libraries: '-lproc'
2024-06-19T11:59:29.405+0100	info	[email protected]/service.go:115	Setting up own telemetry...
2024-06-19T11:59:29.405+0100	info	[email protected]/telemetry.go:96	Serving metrics	{"address": ":8888", "level": "Normal"}
2024-06-19T11:59:29.405+0100	info	[email protected]/exporter.go:280	Development component. May change in the future.	{"kind": "exporter", "data_type": "traces", "name": "debug"}
2024-06-19T11:59:29.406+0100	info	[email protected]/service.go:182	Starting otelcontribcol...	{"Version": "0.102.0-dev", "NumCPU": 10}
2024-06-19T11:59:29.406+0100	info	extensions/extensions.go:34	Starting extensions...
2024-06-19T11:59:29.406+0100	warn	[email protected]/warning.go:42	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.	{"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-19T11:59:29.407+0100	info	[email protected]/otlp.go:102	Starting GRPC server	{"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": ":4317"}
2024-06-19T11:59:29.407+0100	warn	[email protected]/warning.go:42	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.	{"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-19T11:59:29.407+0100	info	[email protected]/otlp.go:152	Starting HTTP server	{"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": ":4318"}
2024-06-19T11:59:29.407+0100	info	[email protected]/service.go:208	Everything is ready. Begin running and processing data.
2024-06-19T11:59:29.407+0100	warn	localhostgate/featuregate.go:63	The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default.	{"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-19T13:38:26.131+0100	info	TracesExporter	{"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 1}
2024-06-19T13:38:26.133+0100	info	ResourceSpans #0
Resource SchemaURL:
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.25.0)
     -> service.name: Str(service.worker)
     -> telemetry.auto.version: Str(0.46b0)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope worker
Span #0
    Trace ID       : 92776c0eb3c2a56cab8827066d03b4c3
    Parent ID      :
    ID             : e295a5bacda12b5a
    Name           : worker
    Kind           : Internal
    Start time     : 2024-06-19 12:37:55.499827 +0000 UTC
    End time       : 2024-06-19 12:37:55.499846 +0000 UTC
    Status code    : Unset
    Status message :
Attributes:
     -> job: Str(boring)
	{"kind": "exporter", "data_type": "traces", "name": "debug"}

Additional context

This is related to support being added for invert_match in AND policies in #9768

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-19T13:22:50Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2024-06-20T09:17:51Z

@jamesrwhite, thank you for the detailed report! I agree with your assessment: the and policy should definitely return the original decision there. I looked at the branch with the proposal and I would definitely like to review and merge a PR with that code.

jamesrwhite · 2024-06-20T10:42:50Z

@jpkrohling thanks for taking a look at this so soon, I'll open a PR shortly 👍

…hen inside and sub policy (#33671) **Description:** This fixes the handling of AND policies that contain a sub-policy with invert_match=true. Previously if the decision from a policy evaluation was `NotSampled` or `InvertNotSampled` it would return a `NotSampled` decision regardless, effectively downgrading the result. This was breaking the documented behaviour that inverted decisions should take precedence over all others. This is related to the changes made in #9768 that introduced support for using invert_match within and sub policies. **Link to tracking Issue:** #33656 **Testing:** I tested manually that this fixes the issue described in #33656 and also updated the tests. If you have any suggestions for more tests we could add let me know. **Documentation:**

crobert-1 · 2024-06-20T16:40:16Z

It looks like this was resolved by #33671, but please correct me if I'm missing something.

jpkrohling · 2024-12-04T16:19:07Z

I'm reopening this given what we learned in #34085.

rjtferreira · 2025-01-07T13:25:13Z

I've recently started migrating our Honeycomb based stack to Grafana Tempo and now I'm leveraging tailsamplingprocessor to filter/sample our traces and ran into this issue.

I've been following this trail of issues #34085 , #33671 and at least for me it seems we're running into this overcomplication because we're missing a drop policy, have we ever considered adding something like that? The way I'm seeing this would be something similar to what happens on Honeycomb Refinery where we can specifically match attributes with an action to drop

github-actions · 2025-03-10T03:31:03Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling @portertech

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mbark · 2025-04-10T10:17:21Z

We have almost exactly the same scenario as described in the original issue (a string attribute and span count should result in not being sampled). Given that the proposed fix for this was reverted is there some other way to get around this for now?

#### Description This pull-request introduces a new policy type to explicitly drop traces regardless of other policy decisions. The new `drop` policy type behaves much like an `and`, however, it returns a `Dropped` decision when all of its policies return `Sampled`. A `Dropped` decision takes precedence over all others. The `Dropped` decision is not new, however, it was unused and therefor safe to use for this purpose. This new policy type is very **approachable** and capable. This new policy type can be used to replace the top-level (not within an `and` policy) use of `invert_match` to control the final sampling decision. We could deprecate the current `invert_match` top-level decision logic. The string, numeric, and boolean filter policies still support `invert_match`, which continues to flip the decision for the individual policy. Let `invert_match` be simple. Example: ``` { name: drop-it-like-its-hot, type: drop, drop: { drop_sub_policy: [ { name: do-not-sample, type: boolean_attribute, boolean_attribute: { key: app.do_not_sample, value: true }, } ] } } ``` This is a reduced version of a previous PR #37760 #### Related Issues - #36673 - #33656 - #36795 - #34296 - #34085 - #29637 - #27049 - Probably more 😅 --------- Signed-off-by: Sean Porter <[email protected]>

jamesrwhite added bug Something isn't working needs triage New item requiring triage labels Jun 19, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Jun 19, 2024

jpkrohling assigned jamesrwhite Jun 20, 2024

jpkrohling removed the needs triage New item requiring triage label Jun 20, 2024

jamesrwhite mentioned this issue Jun 20, 2024

[processor/tailsampling] fix InvertNotSampled decision precedence when inside and sub policy #33671

Merged

crobert-1 closed this as completed Jun 20, 2024

LucaLanziani mentioned this issue Jun 20, 2024

Weekly Report: 2024-06-13 - 2024-06-20 LucaLanziani/opentelemetry-collector-contrib#14

Closed

hyang023 mentioned this issue Jul 16, 2024

[processor/tailsampling] invert_match precedence #34085

Closed

jpkrohling reopened this Dec 4, 2024

portertech mentioned this issue Feb 6, 2025

[processor/tailsampling] Replace invert sampled with drop policy #37760

Closed

github-actions bot added the Stale label Mar 10, 2025

github-actions bot removed the Stale label Apr 11, 2025

portertech mentioned this issue Apr 25, 2025

[processor/tailsampling] Drop policy type #39668

Merged

portertech mentioned this issue May 2, 2025

[processor/tailsampling] Deprecate invert sample decisions #39833

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsampling] invert_match not given precedence when inside and policy #33656

[processor/tailsampling] invert_match not given precedence when inside and policy #33656

jamesrwhite commented Jun 19, 2024 •

edited

Loading

github-actions bot commented Jun 19, 2024

jpkrohling commented Jun 20, 2024

jamesrwhite commented Jun 20, 2024

crobert-1 commented Jun 20, 2024

jpkrohling commented Dec 4, 2024

rjtferreira commented Jan 7, 2025

github-actions bot commented Mar 10, 2025

mbark commented Apr 10, 2025

[processor/tailsampling] invert_match not given precedence when inside and policy #33656

[processor/tailsampling] invert_match not given precedence when inside and policy #33656

Comments

jamesrwhite commented Jun 19, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jun 19, 2024

jpkrohling commented Jun 20, 2024

jamesrwhite commented Jun 20, 2024

crobert-1 commented Jun 20, 2024

jpkrohling commented Dec 4, 2024

rjtferreira commented Jan 7, 2025

github-actions bot commented Mar 10, 2025

mbark commented Apr 10, 2025

jamesrwhite commented Jun 19, 2024 •

edited

Loading