Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GMP and Cloud Monitoring for gke-batch-refarch #856

Merged
merged 4 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Monitoring Kueue with Google Managed Prometheus and Cloud Monitoring

This document describes how to monitor Kueue metrics using Google Managed Prometheus and Cloud Monitoring.

## Overview

You can configure Google Managed Prometheus to automatically collect Kueue metrics. The collected metrics are then exported and made available in Google Cloud's Monitoring service.

## Viewing the Dashboard

The Kueue dashboard is available in Google Cloud Monitoring. This dashboard provides a visual representation of key Kueue metrics, allowing you to quickly assess the health and performance of your Kueue system.

<img src="../../../images/kueue_cloud_monitoring_1.png" width="800">
<img src="../../../images/kueue_cloud_monitoring_2.png" width="800">

## Configuring Managed Collection and Creating the Dashboard

Run the following command to configure Managed Collection for Kueue and Create the Dashboard in Cloud Monitoring.

```bash
./install-gmp.sh
```

## Querying Metrics

You can also query Kueue metrics directly using the [Google Cloud Monitoring - Metrics explorer](https://console.cloud.google.com/monitoring/metrics-explorer) interface. Both PromQL and MQL are supported for querying.

For more information, refer to the [Cloud Monitoring Documentation](https://cloud.google.com/monitoring/charts/metrics-explorer).

### Example Queries

Here are some sample PromQL queries to help you get started with monitoring your Kueue system:

#### Job Throughput

```promql
sum(rate(kueue_admitted_workloads_total[5m])) by (cluster_queue)
```

This query calculates the per-second rate of admitted workloads over 5 minutes for each cluster queue. Summing them provides the overall system throughput, while breaking it down by queue helps pinpoint potential bottlenecks.

#### Resource Utilization (`requires metrics.enableClusterQueueResources`)

```promql
sum(kueue_cluster_queue_resource_usage{resource="cpu"}) by (cluster_queue) / sum(kueue_cluster_queue_nominal_quota{resource="cpu"}) by (cluster_queue)
```

This query calculates the ratio of current CPU usage to the nominal CPU quota for each queue. A value close to 1 indicates high CPU utilization. You can adapt this for memory or other resources by changing the resource label.

>__Important__: This query requires the metrics.enableClusterQueueResources setting to be enabled in your Kueue manager's configuration. To enable this setting, follow the instructions in the Kueue installation documentation: [https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version)

#### Queue Wait Times
```promql
histogram_quantile(0.9, kueue_admission_wait_time_seconds_bucket{cluster_queue="QUEUE_NAME"})
```
This query provides the 90th percentile wait time for workloads in a specific queue. You can modify the quantile value (e.g., 0.5 for median, 0.99 for 99th percentile) to understand the wait time distribution. Replace `QUEUE_NAME` with the actual name of the queue you want to monitor.
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: kueue-metrics-reader
namespace: kueue-system
automountServiceAccountToken: true
---
apiVersion: v1
kind: Secret
metadata:
name: kueue-metrics-reader-token
namespace: kueue-system
annotations:
kubernetes.io/service-account.name: kueue-metrics-reader
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kueue-secret-reader
namespace: kueue-system
rules:
- resources:
- secrets
apiGroups: [""]
verbs: ["get", "list", "watch"]
resourceNames: ["kueue-metrics-reader-token"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kueue-metrics-reader
subjects:
- kind: ServiceAccount
name: kueue-metrics-reader
namespace: kueue-system
roleRef:
kind: ClusterRole
name: kueue-metrics-reader
apiGroup: rbac.authorization.k8s.io
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: kueue
namespace: kueue-system
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- port: https
interval: 30s
path: /metrics
scheme: https
tls:
insecureSkipVerify: true
authorization:
type: Bearer
credentials:
secret:
name: kueue-metrics-reader-token
key: token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: gmp-system:collector:kueue-secret-reader
namespace: kueue-system
roleRef:
name: kueue-secret-reader
kind: Role
apiGroup: rbac.authorization.k8s.io
subjects:
- name: collector
namespace: gmp-system
kind: ServiceAccount
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[[ ! "${PROJECT_ID}" ]] && echo -e "Please export PROJECT_ID variable (\e[95mexport PROJECT_ID=<YOUR POROJECT ID>\e[0m)\nExiting." && exit 0
echo -e "\e[95mPROJECT_ID is set to ${PROJECT_ID}\e[0m"

[[ ! "${REGION}" ]] && echo -e "Please export REGION variable (\e[95mexport REGION=<YOUR REGION, eg: us-central1>\e[0m)\nExiting." && exit 0
echo -e "\e[95mREGION is set to ${REGION}\e[0m"

kubectl apply -f gmp-kueue-monitoring.yaml && \
gcloud monitoring dashboards create --project=$PROJECT_ID --config-from-file=kueue-dashboard.json
Loading