Skip to content

[receiver/dockerstats] not generating per container metrics #33303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
schewara opened this issue May 29, 2024 · 12 comments
Open

[receiver/dockerstats] not generating per container metrics #33303

schewara opened this issue May 29, 2024 · 12 comments

Comments

@schewara
Copy link

Component(s)

exporter/prometheus, receiver/dockerstats, receiver/prometheus

What happened?

Description

We have a collector running (in docker), which is supposed to collect

  • docker container stats through receiver/dockerstats
  • app metrics, through receiver/prometheus and docker_sd_config
  • export the collected metrics trough exporter/prometheusexporter

A similar issue was already reported but was closed without any real solution -> #21247

Steps to Reproduce

  1. have a container running which exposes metrics, like prometheus/node-exporter
  2. start a otel/opentelemetry-collector-contrib container
  3. observe the /metrics endpoint of the prometheus exporter

Expected Result

Individual metrics for each container running on the same host.

Actual Result

Only metrics which have Data point attributes are shown like the following, plus the metrics coming from the prometheus receiver.

container_network_io_usage_rx_errors_total{interface="eth1"} 0

Test scenarios and observations

exporter/prometheus - resource_to_telemetry_conversion - enabled

When enabling the config options, the following was observed

  • receiver/dockerstats metrics are available as expected
  • receiver/prometheus metrics are gone

I don't really know how the prometheus receiver converts the scraped metrics into an otel object, but it looks like that it creates individual metrics + a target_info metric only containing Data point attributes but no Resource attributes.

This would explain, that the metrics disappear, as from what it seems, all existing metric labels are wiped and replaced with nothing.

manually setting attribute labels

Trying to set manual static attributes through the attributes processor only added a new label, to the single metrics, but did not produce individual container metrics

After going through all the logs and searching through all the documentation I discovered the
Setting resource attributes as metric labels section from the prometheus exporter, when implemented (see the commented out sections of the config), metrics from the dockerstats receiver showed up on the exporters /metrics endpoint, but are still missing some crucial labels, which might need to be added manually as well.

Findings

Based on all the observations during testing and trying things out, these are my takeaways for the current shortcomings of the 3 selected components and how they are not very good integrated with each other.

receiver/dockerstats

  • The received data from docker should be properly set up as either resource or datapoint attribute
  • The config settings or maybe just the documentation for the container_labels_to_metric_labels and env_vars_to_metric_labels settings is incorrect, as they are not added as a datapoint attribute and therefore never show up in any metric labels
  • For the metrics to work with prometheus, they should include a job and an instance label,
    by using the service.namespace,service.name,service.instance.id resource attributes, which then hopefully get picked up correctly by the exporter to convert it into the right label.

receiver/prometheus

  • I was under the impression, that the labels from the docker_sd_configs are added as resource attributes to the scraped metrics.
    But as I can't find the link to the source right now I am either mistaken or it just is not the case, looking at the log outputs and the target_info metrics.

exporter/prometheusexporter

  • Looking at the documentation and the target_info metric, I am missing the resource attributes from the dockerstats metrics. Maybe this is due to the missing service attributes or some other reason, but I was unable to see any errors or warnings in the standard log
  • The resource_to_telemetry_conversion functionality left me a bit speechless, that it wipes all datapoint attributes, especially when there are no resource attributes available.
    Also activating it would mean, that I would loose (as an example) the interface information from the container.network.io.usage.rx_bytes metric, without any idea from where the actual value is taken or calculated from.
    A warning in the documentation would be really helpful, or a flag to adjust the behavior based on individual needs.

Right now I am torn between manually transform all the labels of the dockerstats receiver, or
create duplicate pipelines with a duplicated exporter, but either way there is some room for improvement to have everything working together smoothly.

Collector version

otel/opentelemetry-collector-contrib:0.101.0

Environment information

Environment

Docker

OpenTelemetry Collector configuration

receivers:
  docker_stats:
    api_version: '1.45'
    collection_interval: 10s
    container_labels_to_metric_labels:
      com.docker.compose.project: compose.project
      com.docker.compose.service: compose.service
    endpoint: "unix:///var/run/docker.sock"
    initial_delay: 1s
    metrics:
      container.restarts:
        enabled: true
      container.uptime:
        enabled: true
    timeout: 5s
  otlp:
    protocols:
      grpc: null
      http: null
  prometheus:
    config:
      global:
        scrape_interval: 30s
      scrape_configs:
      - job_name: otel-collector
        relabel_configs:
        - replacement: static.instance.name
          source_labels:
          - __address__
          target_label: instance
        scrape_interval: 30s
        static_configs:
        - targets:
          - localhost:8888
      - docker_sd_configs:
        - filters:
          - name: label
            values:
            - prometheus.scrape=true
          host: unix:///var/run/docker.sock
        job_name: docker-containers
        relabel_configs:
        - action: replace
          source_labels:
          - __meta_docker_container_label_prometheus_path
          target_label: __metrics_path__
        - action: replace
          regex: /(.*)
          source_labels:
          - __meta_docker_container_name
          target_label: container_name
        - action: replace
          separator: ':'
          source_labels:
          - container_name
          - __meta_docker_container_label_prometheus_port
          target_label: __address__
        - replacement: static.instance.name
          source_labels:
          - __address__
          target_label: instance
        - action: replace
          source_labels:
          - __meta_docker_container_id
          target_label: container_id
        - action: replace
          source_labels:
          - __meta_docker_container_id
          target_label: service_instance_id
        - action: replace
          source_labels:
          - __meta_docker_container_label_service_namespace
          target_label: service_namespace
        - action: replace
          source_labels:
          - container_name
          target_label: service_name
        - action: replace
          source_labels:
          - __meta_docker_container_label_deployment_environment
          target_label: deployment_environment
        - action: replace
          regex: (.+/)?/?(.+)
          replacement: $${1}$${2}
          separator: /
          source_labels:
          - service_namespace
          - service_name
          target_label: job
        scrape_interval: 30s

processors:
  batch: null
  resourcedetection/docker:
    detectors:
    - env
    - docker
    override: true
    timeout: 2s   
#  transform/dockerstats:
#    metric_statements:
#      - context: datapoint
#        statements:
#          - set(attributes["container.id"], resource.attributes["container.id"])
#          - set(attributes["container.name"], resource.attributes["container.name"])
#          - set(attributes["container.hostname"], resource.attributes["container.hostname"])
#          - set(attributes["host.name"], resource.attributes["host.name"])
#          - set(attributes["compose.project"], resource.attributes["compose.project"])
#          - set(attributes["compose.service"], resource.attributes["compose.service"])
#          - set(attributes["deployment.environment"], resource.attributes["deployment.environment"])
#          - set(attributes["service.namespace"], resource.attributes["service.namespace"])

service:
  pipelines:
    metrics:
      exporters:
      - prometheus
      - logging
      processors:
      # - transform/dockerstats
      - resourcedetection/docker
      - batch
      receivers:
      - otlp
      - docker_stats
      - prometheus

Log output

some snippets from individual metric log entries 

`nodexporter` metric through `receiver/prometheus` which contains Data point attributes but no Resource attributes


Metric #9
Descriptor:
     -> Name: node_disk_flush_requests_total
     -> Description: The total number of flush requests completed successfully
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> container_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> container_name: Str(nodeexporter)
     -> device: Str(sr0)
     -> service_instance_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> service_name: Str(nodeexporter)
StartTimestamp: 2024-05-29 19:03:04.033 +0000 UTC
Timestamp: 2024-05-29 19:03:04.033 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> container_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> container_name: Str(nodeexporter)
     -> device: Str(vda)
     -> service_instance_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> service_name: Str(nodeexporter)
StartTimestamp: 2024-05-29 19:03:04.033 +0000 UTC
Timestamp: 2024-05-29 19:03:04.033 +0000 UTC
Value: 6907671.000000

receiver/dockerstats metric with a Data point attribute, but no Resource attribute

Descriptor:
     -> Name: container.network.io.usage.rx_bytes
     -> Description: Bytes received by the container.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> interface: Str(eth0)
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 425806
NumberDataPoints #1
Data point attributes:
     -> interface: Str(eth1)
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 1631176

receiver/dockerstats metric with no Data point attribute, but Resource attributes

Metric #18
Descriptor:
     -> Name: container.uptime
     -> Description: Time elapsed since container start time.
     -> Unit: s
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 5070.807996
ResourceMetrics #5
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> container.runtime: Str(docker)
     -> container.hostname: Str(my-hostname)
     -> container.id: Str(cacdf88cadd7d8691efefbd0f0c49d256718830b89a3e47f6b65e8e7378534e6f)
     -> container.image.name: Str(my/test-container:0.0.9)
     -> container.name: Str(my-test-container)
     -> host.name: Str(my-hostname)
     -> os.type: Str(linux)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/dockerstatsreceiver 0.101.0


### Additional context

_No response_
@schewara schewara added bug Something isn't working needs triage New item requiring triage labels May 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jamesmoessis
Copy link
Contributor

I'm not sure I fully understand your issue, but it's seemingly not to do with the docker stats receiver. It just seems like the prometheus exporter isn't exporting what you expect, or there is some misunderstanding about what it makes available.

I think this would be more helpful if you identified one component that wasn't operating as expected. The docker stats receiver and the prometheus exporter have nothing to do with each other.

If your problem is that the docker stats receiver isn't reporting a metric that it should be, then it's a problem with the docker stats. If the prom exporter isn't doing what you think it should, that's a problem with the prom exporter (or a misconfiguration).

From what I can see it seems that the docker stats receiver is producing all of the information it should, and then the prom exporter is stripping some of the information that you expect. You can verify this by replacing the prom exporter with the debugexporter and see the output straight into stdout. If it's what you expect then you can narrow down the issue to the prom exporter.

@Mendes11
Copy link

Mendes11 commented Jun 20, 2024

I'm experiencing the same issue, but my exporter is awsemf.

Using the debugger, I can see the labels in Resource Attributes, but it seems awsemf is just using the Data Point Attributes when sending the metrics, and it ignores what's in the Resource attributes?

 ResourceMetrics #4
 Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
 Resource attributes:
      -> container.runtime: Str(docker)
      -> container.hostname: Str(c3e61d730cb6)
      -> container.id: Str(c3e61d730cb6c5936b5862844d6e4acf60a880821610a7af9f9a689cffb966db)
      -> container.image.name: Str(couchdb:2.3.1@sha256:5c83dab4f1994ee4bb9529e9b1d282406054a1f4ad957d80df9e1624bdfb35d7)
      -> container.name: Str(swarmpit_db.1.usj3zlnoxmwjhjc27tc3g5he0)
      -> swarm_service: Str(swarmpit_db)
      -> swarm_container_id: Str(usj3zlnoxmwjhjc27tc3g5he0)
      -> swarm_namespace: Str(swarmpit)
 ScopeMetrics #0
 ScopeMetrics SchemaURL:
 InstrumentationScope otelcol/dockerstatsreceiver 1.0.0
 Metric #0
 Descriptor:
      -> Name: container.blockio.io_service_bytes_recursive
      -> Description: Number of bytes transferred to/from the disk by the group and descendant groups.
      -> Unit: By
      -> DataType: Sum
      -> IsMonotonic: true
      -> AggregationTemporality: Cumulative
 NumberDataPoints #0
 Data point attributes:
      -> device_major: Str(259)
      -> device_minor: Str(0)
      -> operation: Str(read)
 StartTimestamp: 2024-06-20 19:10:25.725911895 +0000 UTC
 Timestamp: 2024-06-20 19:19:28.761889055 +0000 UTC
 Value: 4366336
 NumberDataPoints #1
 Data point attributes:
      -> device_major: Str(259)
      -> device_minor: Str(0)
      -> operation: Str(write)
 StartTimestamp: 2024-06-20 19:10:25.725911895 +0000 UTC
 Timestamp: 2024-06-20 19:19:28.761889055 +0000 UTC
 Value: 4096

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

github-actions bot commented Dec 3, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added Stale and removed Stale labels Dec 3, 2024
@jamesmoessis
Copy link
Contributor

@schewara #21247 was closed because it's not an issue with the docker stats receiver, and it's expected behaviour of the prometheus exporter.

The received data from docker should be properly set up as either resource or datapoint attribute
The config settings or maybe just the documentation for the container_labels_to_metric_labels and env_vars_to_metric_labels settings is incorrect, as they are not added as a datapoint attribute and therefore never show up in any metric labels

I agree the documentation could use some work here. In particular, defining what goes into resource attributes and what goes into metric data point attributes. It is working as expected. Resource attributes are in relation to the resource (the container metadata) so the container name, ID, and any env vars/labels on the container are in resource attributes. Then the data point attributes are related to specific data points, like which network interface was this specific data point measuring.

For the metrics to work with prometheus, they should include a job and an instance label,
by using the service.namespace,service.name,service.instance.id resource attributes, which then hopefully get picked up correctly by the exporter to convert it into the right label.

I don't think it's the jurisdiction of the docker stats receiver to change it's behaviour because a specific wire protocol works a certain way. IMO, this is working as expected. If you want resource attributes to be added into prom, you can use the resource_to_telemetry_conversion as you mentioned.

It seemed the real problem was that you lost other metrics when you used resource_to_telemetry_conversion. I think you'd have more luck getting this issue fixed if you focussed on that part rather than involving lots of other parts of the collector.

@dima-sh-papaya
Copy link

hi, i stumbled upon not having the docker_stats receiver available at all, in the latest upstream collector which i installed from docs (on ubuntu), getting an error out of the box:

error decoding 'receivers': unknown type: "docker_stats" for id: "docker_stats" (valid values: [hostmetrics jaeger kafka....
at first i thought maybe it got renamed or deprecated but saw this fresh thread and realized it still exists, does it need to be enabled somehow or what am i missing? (i used just the upstream docs and default example when got the error, very confused, i didn't even start customizing it much, just selected the metrics i want)

does it mean the current default collector supplied in the deb package that's shown in docs at the quickstart section, doesn't include the docker receiver, so i should build a custom one and include the receiver in it? (are there official images on dockerhub that are better to use instead?)

@dima-sh-papaya
Copy link

nevermind, saw that there's 'core' and 'otelcol-contrib' , the contrib version has the docker receiver 😁 i'll keep it here since the thread comes up in google for the error search, if others wonder why it happens right after following the quckstart in docs. i might not be the only dummy out there 😆

@schewara
Copy link
Author

schewara commented Feb 6, 2025

@jamesmoessis

I am aware, that I threw in a couple of line protocols, but mainly for comparison and to provide a broader and systems view of a common setup which has

  • multiple input sources from multiple protocols (excluding traces and logs to keep it simpler)
  • multiple outputs, logging/debug (otel native) and prometheus_exporter
  • some metadata processing and transformations based on individual needs

and showcasing that in it's current state, working with all these multiple signals and sources is quite painful, as all these parts are massively interfering with, instead of complimenting each other and a good chance of potential data loss, which at least to my standards, is considered not a good thing.


But let's break it down a bit further to hopefully make it more understandable and with the focus on the receiver/dockerstats and remove the Prometheus Exporter and Receiver from the discussion.

Using service.name, service.instance.id, service.namespace attributes

I don't think it's the jurisdiction of the docker stats receiver to change it's behaviour because a specific wire protocol works a certain way.

There seems to be a bit of a misunderstanding here and has absolutely nothing to to with any wire protocol besides the OpenTelemetry Semantic convention for resources, with the service.name being a MUST requirement.

see the following links for more details on it

and as showed in the log outputs no metric provides a service. attribute, which in my opinion should be fixed.

Prometheus compatibility

As Prometheus is explicitly mentioned in the OTel Specs for Metrics Exporters I naively assume that all metric receivers should try to provide the minimum requirements to work with Prometheus out of the box, which is documented here

This page also covers the previously mentioned resource attributes at the bottom of the document.

With the Release of Prometheus 3.0 and native otlp support things change a bit, but to work properly it still requires the OpenTelemetry native service.name and service.instance.id to be present, which again is currently not the case.

It is working as expected.

Resource attributes are in relation to the resource (the container metadata) so the container name, ID, and any env vars/labels on the container are in resource attributes. Then the data point attributes are related to specific data points, like which network interface was this specific data point measuring.

I have to respectfully disagree with you on this point for the following reason.

The container.network.io.usage.rx_bytes metric has no Resource Attributes, which makes it impossible to know which container(container.name,container.id) on which host.name received how many bytes.
Some metrics do have attributes but others don't, therefore it would be great if you could explain this to me in more detail to reach a common understanding.
My opinion here is based on the logging/debug output which is raw otel and no other wire protocol conversion involved.


To sum it up in one sentence:

The receiver/dockerstats should include all the currently missing attributes to be able to map a specific metric to a unique container to be able to answer a simple question like:
"Which container has the highest network IO?".

I hope this helps to better understand where I am coming from and what should be fixed in my opinion.

@jamesmoessis
Copy link
Contributor

Thanks @schewara for extra clarification. I understand your issues better now.

The container.network.io.usage.rx_bytes metric has no Resource Attributes, which makes it impossible to know which container(container.name,container.id) on which host.name received how many bytes.

This isn't correct, all the metrics coming from the dockerstats receiver have those resource attributes. I would double check how you are interpreting that log output. Remember resource attributes are at the top level and many metric data points can be grouped under them. There is a test in the receiver that demonstrates this, for example the expected metrics in this test are all grouped under the same resource: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/dockerstatsreceiver/testdata/mock/single_container/expected_metrics.yaml

If this turns out to not have resource attributes in some scenarios, that's a valid bug that needs to be fixed.

There seems to be a bit of a misunderstanding here and has absolutely nothing to to with any wire protocol besides the OpenTelemetry Semantic convention for resources, with the service.name being a MUST requirement.

I see your issue more clearly now. It's that the dockerstats receiver doesn't set service.name which is a key field. Yes that is a problem I agree. In the current state the dockerstatsreceiver can't know the service name, since it would have to be specified by user. You can use the resourceprocessor to add your own service.name, but I agree maybe there should be an optional config field on the dockerstatsreceiver to add this for convenience. I'd be happy to review a PR for that, or maybe get an opinion of a maintainer.

I'd also like to add the this is technically a "MUST" for the semantic convention, not a "MUST" to adhere to OTLP spec. It's still valid OTLP without service.name as far as I understand.

@schewara
Copy link
Author

schewara commented Feb 10, 2025

@jamesmoessis I am glad, that I was able to make it more clear.

This isn't correct, all the metrics coming from the dockerstats receiver have those resource attributes. I would double check how you are interpreting that log output. Remember resource attributes are at the top level and many metric data points can be grouped under them. ..

Thank you for the pointer to the test. I will check again and let you know if I missed something here on my end.

I'd also like to add the this is technically a "MUST" for the semantic convention, not a "MUST" to adhere to OTLP spec. It's still valid OTLP without service.name as far as I understand.

I agree with you on this one and would understand it the same way.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants