|
| 1 | +# Monitoring Kueue with Google Managed Prometheus and Cloud Monitoring |
| 2 | + |
| 3 | +This document describes how to monitor Kueue metrics using Google Managed Prometheus and Cloud Monitoring. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +You can configure Google Managed Prometheus to automatically collect Kueue metrics. The collected metrics are then exported and made available in Google Cloud's Monitoring service. |
| 8 | + |
| 9 | +## Viewing the Dashboard |
| 10 | + |
| 11 | +The Kueue dashboard is available in Google Cloud Monitoring. This dashboard provides a visual representation of key Kueue metrics, allowing you to quickly assess the health and performance of your Kueue system. |
| 12 | + |
| 13 | +<img src="../../../images/kueue_cloud_monitoring_1.png" width="800"> |
| 14 | +<img src="../../../images/kueue_cloud_monitoring_2.png" width="800"> |
| 15 | + |
| 16 | +## Configuring Managed Collection and Creating the Dashboard |
| 17 | + |
| 18 | +Run the following command to configure Managed Collection for Kueue and Create the Dashboard in Cloud Monitoring. |
| 19 | + |
| 20 | +```bash |
| 21 | +./install-gmp.sh |
| 22 | +``` |
| 23 | + |
| 24 | +## Querying Metrics |
| 25 | + |
| 26 | +You can also query Kueue metrics directly using the [Google Cloud Monitoring - Metrics explorer](https://console.cloud.google.com/monitoring/metrics-explorer) interface. Both PromQL and MQL are supported for querying. |
| 27 | + |
| 28 | +For more information, refer to the [Cloud Monitoring Documentation](https://cloud.google.com/monitoring/charts/metrics-explorer). |
| 29 | + |
| 30 | +### Example Queries |
| 31 | + |
| 32 | +Here are some sample PromQL queries to help you get started with monitoring your Kueue system: |
| 33 | + |
| 34 | +#### Job Throughput |
| 35 | + |
| 36 | +```promql |
| 37 | +sum(rate(kueue_admitted_workloads_total[5m])) by (cluster_queue) |
| 38 | +``` |
| 39 | + |
| 40 | +This query calculates the per-second rate of admitted workloads over 5 minutes for each cluster queue. Summing them provides the overall system throughput, while breaking it down by queue helps pinpoint potential bottlenecks. |
| 41 | + |
| 42 | +#### Resource Utilization (`requires metrics.enableClusterQueueResources`) |
| 43 | + |
| 44 | +```promql |
| 45 | +sum(kueue_cluster_queue_resource_usage{resource="cpu"}) by (cluster_queue) / sum(kueue_cluster_queue_nominal_quota{resource="cpu"}) by (cluster_queue) |
| 46 | +``` |
| 47 | + |
| 48 | +This query calculates the ratio of current CPU usage to the nominal CPU quota for each queue. A value close to 1 indicates high CPU utilization. You can adapt this for memory or other resources by changing the resource label. |
| 49 | + |
| 50 | +>__Important__: This query requires the metrics.enableClusterQueueResources setting to be enabled in your Kueue manager's configuration. To enable this setting, follow the instructions in the Kueue installation documentation: [https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version) |
| 51 | +
|
| 52 | +#### Queue Wait Times |
| 53 | +```promql |
| 54 | +histogram_quantile(0.9, kueue_admission_wait_time_seconds_bucket{cluster_queue="QUEUE_NAME"}) |
| 55 | +``` |
| 56 | +This query provides the 90th percentile wait time for workloads in a specific queue. You can modify the quantile value (e.g., 0.5 for median, 0.99 for 99th percentile) to understand the wait time distribution. Replace `QUEUE_NAME` with the actual name of the queue you want to monitor. |
0 commit comments