Description
Component(s)
connector/spanmetrics
What happened?
Description
As the spanmetrics & servicegraph conenctors consider the timestamp of the spans when the collector receives it, the calls_total metric shows incorrect values when the collector pods are restarted. This is because the collector finds that the number of spans it receives it increased because of the more number of spans being sent to the collector when it is restarted. But, this is not the right indication of the increase in number of calls to a particular service.
Steps to Reproduce
Restart the collector pods, and then observe a spike in the calls_total metric even when the calls to the backend service didn't actually rise. The collector is deployed as a statefulset and it receives the spans from another collector using loadbalancer exporter with "service.name" as routing_key.
Expected Result
There shouldn't be any spike in rate(calls_total) metric when collector is restarted.
Actual Result
There is a spike appears in rate(calls_total) metric when collector pods are restarted.
Collector version
0.114.0 or earlier
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
'''
exporters:
debug:
verbosity: basic
loadbalancing/processor-traces-spnsgrh:
protocol:
otlp:
timeout: 30s
tls:
insecure: true
resolver:
k8s:
ports:
- 4317
service: spnsgrh-traces-otel-collector.processor-traces
routing_key: service
extensions:
health_check:
endpoint: ${env:MY_POD_IP}:13133
processors:
batch: {}
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
receivers:
otlp/loadbalancer-traces-spnsgrh:
protocols:
http:
cors:
allowed_origins:
- http://*
- https://*
endpoint: ${env:MY_POD_IP}:4318
include_metadata: true
max_request_body_size: 10485760
service:
extensions:
- health_check
pipelines:
traces/spnsgrh:
exporters:
- loadbalancing/processor-traces-spnsgrh
processors:
- batch
receivers:
- otlp/loadbalancer-traces-spnsgrh
telemetry:
metrics:
address: ${env:MY_POD_IP}:8888
'''
'''
connectors:
servicegraph:
latency_histogram_buckets:
- 100ms
- 250ms
- 500ms
- 1s
- 5s
- 10s
metrics_flush_interval: 30s
store:
max_items: 10
ttl: 2s
spanmetrics:
aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE
dimensions:
- name: http.method
- name: http.status_code
dimensions_cache_size: 1000
events:
dimensions:
- name: exception.type
enabled: true
exclude_dimensions:
- k8s.pod.uid
- k8s.pod.name
- k8s.container.name
- k8s.deployment.name
- k8s.deployment.uid
- k8s.job.name
- k8s.job.uid
- k8s.namespace.name
- k8s.node.name
- k8s.pod.ip
- k8s.pod.start_time
- k8s.replicaset.name
- k8s.replicaset.uid
- azure.vm.scaleset.name
- cloud.resource_id
- host.id
- host.type
- instance
- service.instance.id
- host.name
- job
- dt.entity.host
- dt.entity.process_group
- dt.entity.process_group_instance
- container.id
exemplars:
enabled: true
max_per_data_point: 5
histogram:
explicit:
buckets:
- 1ms
- 10ms
- 20ms
- 50ms
- 100ms
- 250ms
- 500ms
- 800
- 1s
- 2s
- 5s
- 10s
- 15s
metrics_expiration: 5m
metrics_flush_interval: 1m
namespace: span.metrics
resource_metrics_key_attributes:
- service.name
- telemetry.sdk.language
- telemetry.sdk.name
exporters:
debug/servicegraph:
verbosity: basic
debug/spanmetrics:
verbosity: basic
otlphttp/vm-default-processor-servicegraph:
compression: gzip
encoding: proto
endpoint: http://spnsgrh-victoria-metrics-cluster-vminsert.metrics.svc.cluster.local:8480/insert/20/opentelemetry
timeout: 30s
tls:
insecure: true
prometheusremotewrite/vm-default-processor-spanmetrics:
compression: gzip
endpoint: http://spnsgrh-victoria-metrics-cluster-vminsert.metrics.svc.cluster.local:8480/insert/10/prometheus
resource_to_telemetry_conversion:
enabled: true
timeout: 60s
tls:
insecure_skip_verify: true
extensions:
health_check:
endpoint: ${env:MY_POD_IP}:13133
processors:
batch: {}
batch/servicegraph:
send_batch_max_size: 5000
send_batch_size: 4500
timeout: 10s
batch/spanmetrics:
send_batch_max_size: 5000
send_batch_size: 4500
timeout: 10s
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
receivers:
otlp/processor-traces-spansgrph:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 12
http:
endpoint: ${env:MY_POD_IP}:4318
service:
extensions:
- health_check
pipelines:
metrics/servicegraph:
exporters:
- otlphttp/vm-default-processor-servicegraph
processors:
- batch/servicegraph
receivers:
- servicegraph
metrics/spanmetrics:
exporters:
- prometheusremotewrite/vm-default-processor-spanmetrics
processors:
- batch/spanmetrics
receivers:
- spanmetrics
traces/connector-pipeline:
exporters:
- spanmetrics
- servicegraph
processors:
- batch
receivers:
- otlp/processor-traces-spansgrph
telemetry:
metrics:
address: ${env:MY_POD_IP}:8888
'''
Log output
No response
Additional context

