Skip to content

feat(dockerstats): add online cpu and memory failcnt metrics #31481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .chloggen/dockerstats-onlinecpus-failcnt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: dockerstatsreceiver

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: add metrics for online CPU count and memory fails count

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [31366]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [user]
16 changes: 16 additions & 0 deletions receiver/dockerstatsreceiver/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,14 @@ Percent of CPU used by the container.
| ---- | ----------- | ---------- |
| 1 | Gauge | Double |

### container.memory.fails

Number of times the memory limit was hit.

| Unit | Metric Type | Value Type | Aggregation Temporality | Monotonic |
| ---- | ----------- | ---------- | ----------------------- | --------- |
| {fails} | Sum | Int | Cumulative | true |

### container.memory.file

Amount of memory used to cache filesystem data, including tmpfs and shared memory (Only available with cgroups v2).
Expand Down Expand Up @@ -306,6 +314,14 @@ This metric is only reported if the container has limits set with -cpus, -cpuset
| ---- | ----------- | ---------- |
| {cpus} | Gauge | Double |

### container.cpu.online

Number of online CPUs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other documentation available from docker? It's hard to understand what this metric is needed for from this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any documentation from Docker, only some references from the internet that says how people interpret this:
"The number of real used CPUs is given by cpu_stats.online_cpus (4 in our case)." - influxdata/telegraf#2964
"The online_cpus value represents the number of CPU core on the host machine." - https://cratedb.com/blog/analyzing-docker-container-performance-native-tools
Even the source code doesn't say much: https://github.com/moby/moby/blob/460b4aebdf81512a3ea37c2a65826b890427537d/api/types/stats.go#L50

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging deeper into the code, this is the "cpu_num" return value here: https://github.com/moby/moby/blob/460b4aebdf81512a3ea37c2a65826b890427537d/daemon/stats_unix.go#L321
If we go to the man page in the comment, we can see that the command lists cpus with a number like cpu0, cpu1 etc. and Docker just sums number of these different values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it seems like just a number of cores available to the container. We have system.cpu.logical.count for this purpose on the host, see https://github.com/open-telemetry/semantic-conventions/blob/6c79a1d9107c1833d95177beda84947ce1902792/model/metrics/system-metrics.yaml#L90C18-L90C42. Should we call it container.cpu.logical.count? I think online here just adds confusion. cc @open-telemetry/semconv-container-approvers

Copy link
Member

@ChrsMark ChrsMark Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to introduce it as container.cpu.logical_count container.cpu.logical.count to be aligned with system.*, as per our discussions at open-telemetry/semantic-conventions#282 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the name to container.cpu.logical_count as the suggestion said.

Copy link
Member

@ChrsMark ChrsMark Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just checked again the system.* and it seems that in order to be aligned with that it should be container.cpu.logical.count (from system.cpu.logical.count), right? @dmitryax wdyt?

@aboguszewski-sumo my previous comment was wrong, you can ignore it. The correct name we should use here is the container.cpu.logical.count.

The container.cpu.logical_number should be sth different similar to the system.cpu.logical_number. It is the "core ID" from which the metrics are emitted. Sorry for the confusion here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks for responding quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitryax could you rereview? 🙏


| Unit | Metric Type | Value Type |
| ---- | ----------- | ---------- |
| {CPUs} | Gauge | Int |

### container.cpu.shares

CPU shares set for the container.
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

116 changes: 116 additions & 0 deletions receiver/dockerstatsreceiver/internal/metadata/generated_metrics.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading