Skip to content

Commit ded2bcb

Browse files
Adding GMP and Cloud Monitoring for gke-batch-refarch (GoogleCloudPlatform#856)
* Adding GMP and Cloud Monitoring for gke-batch-refarch * Adding Images * Apply suggestions from code review Co-authored-by: Aldo Culquicondor <[email protected]> * Updates to comments --------- Co-authored-by: Aldo Culquicondor <[email protected]>
1 parent bfe7670 commit ded2bcb

File tree

6 files changed

+855
-0
lines changed

6 files changed

+855
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Monitoring Kueue with Google Managed Prometheus and Cloud Monitoring
2+
3+
This document describes how to monitor Kueue metrics using Google Managed Prometheus and Cloud Monitoring.
4+
5+
## Overview
6+
7+
You can configure Google Managed Prometheus to automatically collect Kueue metrics. The collected metrics are then exported and made available in Google Cloud's Monitoring service.
8+
9+
## Viewing the Dashboard
10+
11+
The Kueue dashboard is available in Google Cloud Monitoring. This dashboard provides a visual representation of key Kueue metrics, allowing you to quickly assess the health and performance of your Kueue system.
12+
13+
<img src="../../../images/kueue_cloud_monitoring_1.png" width="800">
14+
<img src="../../../images/kueue_cloud_monitoring_2.png" width="800">
15+
16+
## Configuring Managed Collection and Creating the Dashboard
17+
18+
Run the following command to configure Managed Collection for Kueue and Create the Dashboard in Cloud Monitoring.
19+
20+
```bash
21+
./install-gmp.sh
22+
```
23+
24+
## Querying Metrics
25+
26+
You can also query Kueue metrics directly using the [Google Cloud Monitoring - Metrics explorer](https://console.cloud.google.com/monitoring/metrics-explorer) interface. Both PromQL and MQL are supported for querying.
27+
28+
For more information, refer to the [Cloud Monitoring Documentation](https://cloud.google.com/monitoring/charts/metrics-explorer).
29+
30+
### Example Queries
31+
32+
Here are some sample PromQL queries to help you get started with monitoring your Kueue system:
33+
34+
#### Job Throughput
35+
36+
```promql
37+
sum(rate(kueue_admitted_workloads_total[5m])) by (cluster_queue)
38+
```
39+
40+
This query calculates the per-second rate of admitted workloads over 5 minutes for each cluster queue. Summing them provides the overall system throughput, while breaking it down by queue helps pinpoint potential bottlenecks.
41+
42+
#### Resource Utilization (`requires metrics.enableClusterQueueResources`)
43+
44+
```promql
45+
sum(kueue_cluster_queue_resource_usage{resource="cpu"}) by (cluster_queue) / sum(kueue_cluster_queue_nominal_quota{resource="cpu"}) by (cluster_queue)
46+
```
47+
48+
This query calculates the ratio of current CPU usage to the nominal CPU quota for each queue. A value close to 1 indicates high CPU utilization. You can adapt this for memory or other resources by changing the resource label.
49+
50+
>__Important__: This query requires the metrics.enableClusterQueueResources setting to be enabled in your Kueue manager's configuration. To enable this setting, follow the instructions in the Kueue installation documentation: [https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version)
51+
52+
#### Queue Wait Times
53+
```promql
54+
histogram_quantile(0.9, kueue_admission_wait_time_seconds_bucket{cluster_queue="QUEUE_NAME"})
55+
```
56+
This query provides the 90th percentile wait time for workloads in a specific queue. You can modify the quantile value (e.g., 0.5 for median, 0.99 for 99th percentile) to understand the wait time distribution. Replace `QUEUE_NAME` with the actual name of the queue you want to monitor.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
apiVersion: v1
2+
kind: ServiceAccount
3+
metadata:
4+
name: kueue-metrics-reader
5+
namespace: kueue-system
6+
automountServiceAccountToken: true
7+
---
8+
apiVersion: v1
9+
kind: Secret
10+
metadata:
11+
name: kueue-metrics-reader-token
12+
namespace: kueue-system
13+
annotations:
14+
kubernetes.io/service-account.name: kueue-metrics-reader
15+
type: kubernetes.io/service-account-token
16+
---
17+
apiVersion: rbac.authorization.k8s.io/v1
18+
kind: Role
19+
metadata:
20+
name: kueue-secret-reader
21+
namespace: kueue-system
22+
rules:
23+
- resources:
24+
- secrets
25+
apiGroups: [""]
26+
verbs: ["get", "list", "watch"]
27+
resourceNames: ["kueue-metrics-reader-token"]
28+
---
29+
apiVersion: rbac.authorization.k8s.io/v1
30+
kind: ClusterRoleBinding
31+
metadata:
32+
name: kueue-metrics-reader
33+
subjects:
34+
- kind: ServiceAccount
35+
name: kueue-metrics-reader
36+
namespace: kueue-system
37+
roleRef:
38+
kind: ClusterRole
39+
name: kueue-metrics-reader
40+
apiGroup: rbac.authorization.k8s.io
41+
---
42+
apiVersion: monitoring.googleapis.com/v1
43+
kind: PodMonitoring
44+
metadata:
45+
name: kueue
46+
namespace: kueue-system
47+
spec:
48+
selector:
49+
matchLabels:
50+
control-plane: controller-manager
51+
endpoints:
52+
- port: https
53+
interval: 30s
54+
path: /metrics
55+
scheme: https
56+
tls:
57+
insecureSkipVerify: true
58+
authorization:
59+
type: Bearer
60+
credentials:
61+
secret:
62+
name: kueue-metrics-reader-token
63+
key: token
64+
---
65+
apiVersion: rbac.authorization.k8s.io/v1
66+
kind: RoleBinding
67+
metadata:
68+
name: gmp-system:collector:kueue-secret-reader
69+
namespace: kueue-system
70+
roleRef:
71+
name: kueue-secret-reader
72+
kind: Role
73+
apiGroup: rbac.authorization.k8s.io
74+
subjects:
75+
- name: collector
76+
namespace: gmp-system
77+
kind: ServiceAccount
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright 2024 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
[[ ! "${PROJECT_ID}" ]] && echo -e "Please export PROJECT_ID variable (\e[95mexport PROJECT_ID=<YOUR POROJECT ID>\e[0m)\nExiting." && exit 0
16+
echo -e "\e[95mPROJECT_ID is set to ${PROJECT_ID}\e[0m"
17+
18+
[[ ! "${REGION}" ]] && echo -e "Please export REGION variable (\e[95mexport REGION=<YOUR REGION, eg: us-central1>\e[0m)\nExiting." && exit 0
19+
echo -e "\e[95mREGION is set to ${REGION}\e[0m"
20+
21+
kubectl apply -f gmp-kueue-monitoring.yaml && \
22+
gcloud monitoring dashboards create --project=$PROJECT_ID --config-from-file=kueue-dashboard.json

0 commit comments

Comments
 (0)