This repository was archived by the owner on Jun 23, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 240
Adding GMP and Cloud Monitoring for gke-batch-refarch #856
Merged
arueth
merged 4 commits into
GoogleCloudPlatform:main
from
JamesDuncanNz:gke-batch-refarch-gmp
Oct 28, 2024
Merged
Changes from 3 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
56 changes: 56 additions & 0 deletions
56
best-practices/gke-batch-refarch/02_platform/monitoring/gmp/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Monitoring Kueue with Google Managed Prometheus and Cloud Monitoring | ||
|
||
This document describes how to monitor Kueue metrics using Google Managed Prometheus and Cloud Monitoring. | ||
|
||
## Overview | ||
|
||
You can configure Google Managed Prometheus to automatically collect Kueue metrics. The collected metrics are then exported and made available in Google Cloud's Monitoring service. | ||
|
||
## Viewing the Dashboard | ||
|
||
The Kueue dashboard is available in Google Cloud Monitoring. This dashboard provides a visual representation of key Kueue metrics, allowing you to quickly assess the health and performance of your Kueue system. | ||
|
||
<img src="../../../images/kueue_cloud_monitoring_1.png" width="800"> | ||
<img src="../../../images/kueue_cloud_monitoring_2.png" width="800"> | ||
|
||
## Configuring Managed Collection and Creating the Dashboard | ||
|
||
Run the following command to configure Managed Collection for Kueue and Create the Dashboard in Cloud Monitoring. | ||
|
||
```bash | ||
./install-gmp.sh | ||
``` | ||
|
||
## Querying Metrics | ||
|
||
You can also query Kueue metrics directly using the [Google Cloud Monitoring - Metrics explorer](https://console.cloud.google.com/monitoring/metrics-explorer) interface. Both PromQL and MQL are supported for querying. | ||
|
||
For more information, refer to the [Cloud Monitoring Documentation](https://cloud.google.com/monitoring/charts/metrics-explorer). | ||
|
||
### Example Queries | ||
|
||
Here are some sample PromQL queries to help you get started with monitoring your Kueue system: | ||
|
||
#### Job Throughput | ||
|
||
```promql | ||
sum(rate(kueue_admitted_workloads_total[5m])) by (cluster_queue) | ||
``` | ||
|
||
This query calculates the per-second rate of admitted workloads over 5 minutes for each cluster queue. Summing them provides the overall system throughput, while breaking it down by queue helps pinpoint potential bottlenecks. | ||
|
||
#### Resource Utilization (`requires metrics.enableClusterQueueResources`) | ||
|
||
```promql | ||
sum(kueue_cluster_queue_resource_usage{resource="cpu"}) by (cluster_queue) / sum(kueue_cluster_queue_nominal_quota{resource="cpu"}) by (cluster_queue) | ||
``` | ||
|
||
This query calculates the ratio of current CPU usage to the nominal CPU quota for each queue. A value close to 1 indicates high CPU utilization. You can adapt this for memory or other resources by changing the resource label. | ||
|
||
>__Important__: This query requires the metrics.enableClusterQueueResources setting to be enabled in your Kueue manager's configuration. To enable this setting, follow the instructions in the Kueue installation documentation: [https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version) | ||
|
||
#### Queue Wait Times | ||
```promql | ||
histogram_quantile(0.9, kueue_admission_wait_time_seconds_bucket{cluster_queue="QUEUE_NAME"}) | ||
``` | ||
This query provides the 90th percentile wait time for workloads in a specific queue. You can modify the quantile value (e.g., 0.5 for median, 0.99 for 99th percentile) to understand the wait time distribution. Replace `QUEUE_NAME` with the actual name of the queue you want to monitor. |
77 changes: 77 additions & 0 deletions
77
best-practices/gke-batch-refarch/02_platform/monitoring/gmp/gmp-kueue-monitoring.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
apiVersion: v1 | ||
kind: ServiceAccount | ||
metadata: | ||
name: kueue-metrics-reader | ||
namespace: kueue-system | ||
automountServiceAccountToken: true | ||
--- | ||
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: kueue-metrics-reader-token | ||
namespace: kueue-system | ||
annotations: | ||
kubernetes.io/service-account.name: kueue-metrics-reader | ||
type: kubernetes.io/service-account-token | ||
--- | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: Role | ||
metadata: | ||
name: kueue-secret-reader | ||
namespace: kueue-system | ||
rules: | ||
- resources: | ||
- secrets | ||
apiGroups: [""] | ||
verbs: ["get", "list", "watch"] | ||
resourceNames: ["kueue-metrics-reader-token"] | ||
--- | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRoleBinding | ||
metadata: | ||
name: kueue-metrics-reader | ||
subjects: | ||
- kind: ServiceAccount | ||
name: kueue-metrics-reader | ||
namespace: kueue-system | ||
roleRef: | ||
kind: ClusterRole | ||
name: kueue-metrics-reader | ||
apiGroup: rbac.authorization.k8s.io | ||
--- | ||
apiVersion: monitoring.googleapis.com/v1 | ||
kind: PodMonitoring | ||
metadata: | ||
name: kueue | ||
namespace: kueue-system | ||
spec: | ||
selector: | ||
matchLabels: | ||
control-plane: controller-manager | ||
endpoints: | ||
- port: https | ||
interval: 30s | ||
path: /metrics | ||
scheme: https | ||
tls: | ||
insecureSkipVerify: true | ||
authorization: | ||
type: Bearer | ||
credentials: | ||
secret: | ||
name: kueue-metrics-reader-token | ||
key: token | ||
--- | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: RoleBinding | ||
metadata: | ||
name: gmp-system:collector:kueue-secret-reader | ||
namespace: kueue-system | ||
roleRef: | ||
name: kueue-secret-reader | ||
kind: Role | ||
apiGroup: rbac.authorization.k8s.io | ||
subjects: | ||
- name: collector | ||
namespace: gmp-system | ||
kind: ServiceAccount | ||
8 changes: 8 additions & 0 deletions
8
best-practices/gke-batch-refarch/02_platform/monitoring/gmp/install-gmp.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
[[ ! "${PROJECT_ID}" ]] && echo -e "Please export PROJECT_ID variable (\e[95mexport PROJECT_ID=<YOUR POROJECT ID>\e[0m)\nExiting." && exit 0 | ||
JamesDuncanNz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
echo -e "\e[95mPROJECT_ID is set to ${PROJECT_ID}\e[0m" | ||
|
||
[[ ! "${REGION}" ]] && echo -e "Please export REGION variable (\e[95mexport REGION=<YOUR REGION, eg: us-central1>\e[0m)\nExiting." && exit 0 | ||
echo -e "\e[95mREGION is set to ${REGION}\e[0m" | ||
|
||
kubectl apply -f gmp-kueue-monitoring.yaml && \ | ||
gcloud monitoring dashboards create --project=$PROJECT_ID --config-from-file=kueue-dashboard.json | ||
JamesDuncanNz marked this conversation as resolved.
Show resolved
Hide resolved
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.