Skip to content

[Blog] Built-in UI for monitoring basic GPU metrics #2470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
title: "Monitoring basic GPU metrics via dstack stats"
title: "Monitoring basic GPU metrics via CLI"
date: 2024-10-22
description: "dstack introduces a new CLI command (and API) for monitoring container metrics, incl. GPU usage for NVIDIA, AMD, and other accelerators."
slug: dstack-stats
slug: dstack-metrics
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-stats-v2.png?raw=true
categories:
- AMD
- NVIDIA
- Monitoring
---

# Monitoring basic GPU metrics via dstack stats
# Monitoring basic GPU metrics via CLI

## How it works { style="display:none"}

Expand All @@ -22,6 +22,8 @@ for monitoring container metrics, including GPU usage for `NVIDIA`, `AMD`, and o

<!-- more -->

> Note, the `dstack stats` command has been renamed to `dstack metrics`. The old name is also supported by deprecated.

The command is similar to `kubectl top` (in terms of semantics) and `docker stats` (in terms of the CLI interface). The key
difference is that `dstack stats` includes GPU VRAM usage and GPU utilization percentage.

Expand Down
60 changes: 60 additions & 0 deletions docs/blog/posts/metrics-ui.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: "Built-in UI for monitoring basic GPU metrics"
date: 2025-04-03
description: "TBA"
slug: metrics-ui
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-metrics-ui-v2-min.png?raw=true
categories:
- Monitoring
- AMD
- NVIDIA
---

# Built-in UI for monitoring basic GPU metrics

AI workloads generate vast amounts of metrics, making it essential to have efficient monitoring tools. While our recent
update introduced the ability to export available metrics to Prometheus for maximum flexibility, there are times when
users need to quickly access essential metrics without the need to switch to an external tool.

<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-metrics-ui-v2-min.png?raw=true" width="630"/>

Previously, we introduced a [CLI command](dstack-metrics.md) that allows users to view basic GPU metrics for both NVIDIA
and AMD hardware. Now, with this latest update, we’re excited to announce the addition of a built-in dashboard within
the `dstack` control plane.

<!-- more -->

The new feature provides an easy-to-use interface for tracking the most essential GPU metrics
directly from the control plane, streamlining the real-time monitoring process without needing any additional tools.

<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-metrics-ui-dashboard.png?raw=true" width="800">

Additionally, we’ve renamed the CLI command previously known as `dstack stats` to `dstack metrics` for consistency.

<div class="termy">

```shell
$ dstack metrics nccl-tests -w
NAME CPU MEMORY GPU
nccl-tests 81% 2754MB/1638400MB #0 100740MB/144384MB 100% Util
#1 100740MB/144384MB 100% Util
#2 100740MB/144384MB 99% Util
#3 100740MB/144384MB 99% Util
#4 100740MB/144384MB 99% Util
#5 100740MB/144384MB 99% Util
#6 100740MB/144384MB 99% Util
#7 100740MB/144384MB 100% Util
```

</div>

By default, both the control plane and CLI show metrics from the last hour, which is particularly useful for debugging
workloads.

For persistent storage and long-term access to metrics, we still recommend setting up Prometheus to fetch
metrics from `dstack`.

!!! info "What's next?"
1. See the [Monitoring](../../docs/guides/monitoring.md) guide
2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}
4 changes: 2 additions & 2 deletions docs/blog/posts/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Effective AI infrastructure management requires full visibility into compute per
detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage
across projects.

While `dstack` provides key metrics through its UI and [`dstack metrics`](dstack-stats.md) CLI, teams often need more granular data and prefer
While `dstack` provides key metrics through its UI and [`dstack metrics`](dstack-metrics.md) CLI, teams often need more granular data and prefer
using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected
metrics—covering fleets and runs—directly to Prometheus.

Expand Down Expand Up @@ -57,7 +57,7 @@ For a full list of available metrics and labels, check out the [Monitoring](../.

??? info "AMD"
AMD device metrics are not yet collected for any backends. This support will be available soon. For now, AMD metrics are
only accessible through the UI and the [`dstack metrics`](dstack-stats.md) CLI.
only accessible through the UI and the [`dstack metrics`](dstack-metrics.md) CLI.

!!! info "What's next?"
1. See the [Monitoring](../../docs/guides/monitoring.md) guide
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/guides/protips.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ The GPU vendor is indicated by one of the following case-insensitive values:

While `dstack` allows the use of any third-party monitoring tools (e.g., Weights and Biases), you can also
monitor container metrics such as CPU, memory, and GPU usage using the [built-in
`dstack metrics` CLI command](../../blog/posts/dstack-stats.md) or the corresponding API.
`dstack metrics` CLI command](../../blog/posts/dstack-metrics.md) or the corresponding API.

## Service quotas

Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,10 +127,11 @@ plugins:
'backends.md': 'partners.md'
'developers.md': 'community.md'
'blog/ambassador-program.md': 'blog/archive/ambassador-program.md'
'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-stats.md'
'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-metrics.md'
'blog/inactive-dev-environments-auto-shutdown.md': 'blog/posts/inactivity-duration.md'
'blog/data-centers-and-private-clouds.md': 'blog/posts/gpu-blocks-and-proxy-jump.md'
'blog/distributed-training-with-aws-efa.md': 'blog/posts/efa.md'
'blog/dstack-stats.md': 'blog/posts/dstack-metrics.md'
- typeset
- gen-files:
scripts: # always relative to mkdocs.yml
Expand Down