Skip to content

Commit 939b05a

Browse files
Add server metrics promql scraping (#804)
* Add server metrics promql scraping * Add flag and add metrics to json output * Fix python bool flag logic
1 parent 0294ba3 commit 939b05a

File tree

9 files changed

+113
-13
lines changed

9 files changed

+113
-13
lines changed

benchmarks/benchmark/tools/profile-generator/README.md

+21-11
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
# AI on GKE Benchmark Latency Profile Generator
22

33
<!-- TOC -->
4-
* [AI on GKE Benchmark Latency Profile Generator](#ai-on-gke-benchmark-latency-profile-generator)
5-
* [Overview](#overview)
6-
* [Instructions](#instructions)
7-
* [Step 1: create output bucket](#step-1--create-output-bucket)
8-
* [Step 2: create and give service account access to write to output gcs bucket](#step-2--create-and-give-service-account-access-to-write-to-output-gcs-bucket)
9-
* [Step 3: create artifact repository for automated Latency Profile Generator docker build](#step-3--create-artifact-repository-for-automated-latency-profile-generator-docker-build)
10-
* [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
11-
* [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
12-
* [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
13-
* [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
14-
* [Inputs](#inputs)
4+
- [AI on GKE Benchmark Latency Profile Generator](#ai-on-gke-benchmark-latency-profile-generator)
5+
- [Overview](#overview)
6+
- [Instructions](#instructions)
7+
- [Step 1: create output bucket](#step-1-create-output-bucket)
8+
- [Step 2: create and give service account access to write to output gcs bucket](#step-2-create-and-give-service-account-access-to-write-to-output-gcs-bucket)
9+
- [\[optional\] give service account access to read Cloud Monitoring metrics](#optional-give-service-account-access-to-read-cloud-monitoring-metrics)
10+
- [Step 3: create artifact repository for automated Latency Profile Generator docker build](#step-3-create-artifact-repository-for-automated-latency-profile-generator-docker-build)
11+
- [Step 4: create and configure terraform.tfvars](#step-4-create-and-configure-terraformtfvars)
12+
- [\[optional\] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
13+
- [\[optional\] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
14+
- [Step 5: login to gcloud](#step-5-login-to-gcloud)
15+
- [Step 6: terraform initialize, plan and apply](#step-6-terraform-initialize-plan-and-apply)
1516
<!-- TOC -->
1617

1718
## Overview
@@ -62,6 +63,15 @@ Your kubernetes service account will inherit the reader permissions.
6263
You will set the `latency_profile_kubernetes_service_account` in your
6364
`terraform.tfvars` to the kubernetes service account name.
6465

66+
#### [optional] give service account access to read Cloud Monitoring metrics
67+
68+
If `scrape-server-metrics` is set to True, you will need to give the service account access to read
69+
the Cloud Monitoring metrics. You can do so with the following command:
70+
71+
```
72+
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/monitoring.viewer
73+
```
74+
6575
### Step 3: create artifact repository for automated Latency Profile Generator docker build
6676

6777
The latency profile generator rebuilds the docker file on each terraform apply

benchmarks/benchmark/tools/profile-generator/container/benchmark_serving.py

+67
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,13 @@
1010
from datetime import datetime
1111
import json
1212
import random
13+
import requests
1314
import time
1415
from typing import AsyncGenerator, List, Tuple
1516

17+
import google.auth
18+
import google.auth.transport.requests
19+
1620
import aiohttp
1721
import numpy as np
1822
from transformers import AutoTokenizer
@@ -302,6 +306,60 @@ def save_json_results(args: argparse.Namespace, benchmark_result):
302306
with open(file_name, "w", encoding="utf-8") as outfile:
303307
json.dump(final_json, outfile)
304308

309+
def metrics_to_scrape(backend: str) -> List[str]:
310+
if backend == "vllm":
311+
return ["vllm:gpu_cache_usage_perc", "vllm:num_requests_waiting"]
312+
elif backend == "jetstream":
313+
return ["jetstream_slots_used_percentage", "jetstream_prefill_backlog_size"]
314+
else:
315+
return []
316+
317+
def print_metrics(metrics: List[str], duration: float, backend: str):
318+
# Creates a credentials object from the default service account file
319+
# Assumes that script has appropriate default credentials set up, ref:
320+
# https://googleapis.dev/python/google-auth/latest/user-guide.html#application-default-credentials
321+
credentials, project_id = google.auth.default()
322+
# Prepare an authentication request - helps format the request auth token
323+
auth_req = google.auth.transport.requests.Request()
324+
325+
all_metric_results = {}
326+
327+
for metric in metrics:
328+
print("Metric Name: %s" % (metric))
329+
metric_results = {}
330+
# Queries scrape all metrics collected from the last $DURATION seconds from the backend's related
331+
# podmonitoring spec assumed to be named "$BACKEND-podmonitoring"
332+
queries = {
333+
"Mean": "avg_over_time(%s{job='%s-podmonitoring'}[%.0fs])" % (metric, backend, duration),
334+
"Median": "quantile_over_time(0.5, %s{job='%s-podmonitoring'}[%.0fs])" % (metric, backend, duration),
335+
"Min": "min_over_time(%s{job='%s-podmonitoring'}[%.0fs])" % (metric, backend, duration),
336+
"Max": "max_over_time(%s{job='%s-podmonitoring'}[%.0fs])" % (metric, backend, duration),
337+
"P90": "quantile_over_time(0.9, %s{job='%s-podmonitoring'}[%.0fs])" % (metric, backend, duration),
338+
"P99": "quantile_over_time(0.99, %s{job='%s-podmonitoring'}[%.0fs])" % (metric, backend, duration),
339+
}
340+
for query_name, query in queries.items():
341+
# Request refresh tokens
342+
credentials.refresh(auth_req)
343+
344+
# Configure respective query
345+
url='https://monitoring.googleapis.com/v1/projects/%s/location/global/prometheus/api/v1/query' % (project_id)
346+
headers_api = {'Authorization': 'Bearer ' + credentials.token}
347+
params = {'query': query}
348+
request_post = requests.get(url=url, headers=headers_api, params=params)
349+
response = request_post.json()
350+
351+
# handle response
352+
if request_post.ok:
353+
if response["status"] == "success":
354+
metric_results[query_name] = response["data"]["result"][0]["value"][1]
355+
print("%s: %s" % (query_name, response["data"]["result"][0]["value"][1]))
356+
else:
357+
print("Cloud Monitoring PromQL Error: %s" % (response["error"]))
358+
else:
359+
print("HTTP Error: %s" % (response))
360+
all_metric_results[metric] = metric_results
361+
return all_metric_results
362+
305363

306364
def main(args: argparse.Namespace):
307365
print(args)
@@ -420,6 +478,10 @@ def main(args: argparse.Namespace):
420478
)
421479
benchmark_result['avg_output_len'] = avg_output_len
422480

481+
if args.scrape_server_metrics:
482+
server_metrics = print_metrics(metrics_to_scrape(args.backend), benchmark_time, args.backend)
483+
benchmark_result['server_metrics'] = server_metrics
484+
423485
if args.save_json_results:
424486
save_json_results(args, benchmark_result)
425487

@@ -545,5 +607,10 @@ def main(args: argparse.Namespace):
545607
" the form of a string."
546608
),
547609
)
610+
parser.add_argument(
611+
"--scrape-server-metrics",
612+
action="store_true",
613+
help="Whether to scrape server metrics.",
614+
)
548615
cmd_args = parser.parse_args()
549616
main(cmd_args)

benchmarks/benchmark/tools/profile-generator/container/latency_throughput_curve.sh

+7-1
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,17 @@ export IP=$IP
1919

2020
huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential
2121

22+
PYTHON="python3"
23+
PYTHON_OPTS="benchmark_serving.py "
2224
for request_rate in $(echo $REQUEST_RATES | tr ',' ' '); do
2325
# TODO: Check if profile already exists, if so then skip
2426
timestamp=$(date +"%Y-%m-%d_%H-%M-%S")
2527
output_file="latency-profile-${timestamp}.txt"
26-
python3 benchmark_serving.py --host="$IP" --port="$PORT" --model="$TOKENIZER" --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer="$TOKENIZER" --request-rate=$request_rate --backend="$BACKEND" --num-prompts=$((request_rate * 30)) --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH > $output_file
28+
PYTHON_OPTS="$PYTHON_OPTS --host=$IP --port=$PORT --model=$TOKENIZER --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=$TOKENIZER --request-rate=$request_rate --backend=$BACKEND --num-prompts=$((request_rate * 30)) --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH"
29+
if [[ "$SCRAPE_SERVER_METRICS" = "true" ]]; then
30+
PYTHON_OPTS="$PYTHON_OPTS --scrape-server-metrics"
31+
fi
32+
$PYTHON $PYTHON_OPTS > $output_file
2733
cat $output_file
2834
sleep 5 # wait 5 seconds before next run
2935
done

benchmarks/benchmark/tools/profile-generator/container/requirements.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,5 @@ pydantic >= 2.0 # Required for OpenAI server.
3434
aioprometheus[starlette]
3535
pynvml == 11.5.0
3636
accelerate
37-
aiohttp
37+
aiohttp
38+
google-auth

benchmarks/benchmark/tools/profile-generator/main.tf

+1
Original file line numberDiff line numberDiff line change
@@ -77,4 +77,5 @@ module "latency-profile" {
7777
k8s_hf_secret = var.k8s_hf_secret
7878
hugging_face_secret = var.hugging_face_secret
7979
hugging_face_secret_version = var.hugging_face_secret_version
80+
scrape_server_metrics = var.scrape_server_metrics
8081
}

benchmarks/benchmark/tools/profile-generator/modules/latency-profile/main.tf

+1
Original file line numberDiff line numberDiff line change
@@ -61,5 +61,6 @@ resource "kubernetes_manifest" "latency-profile-generator" {
6161
hugging_face_token_secret_list = local.hugging_face_token_secret == null ? [] : [local.hugging_face_token_secret]
6262
k8s_hf_secret_list = var.k8s_hf_secret == null ? [] : [var.k8s_hf_secret]
6363
output_bucket = var.output_bucket
64+
scrape_server_metrics = var.scrape_server_metrics
6465
}))
6566
}

benchmarks/benchmark/tools/profile-generator/modules/latency-profile/manifest-templates/latency-profile-generator.yaml.tpl

+2
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ spec:
3434
value: ${request_rates}
3535
- name: OUTPUT_BUCKET
3636
value: ${output_bucket}
37+
- name: SCRAPE_SERVER_METRICS
38+
value: ${scrape_server_metrics}
3739
%{ for hugging_face_token_secret in hugging_face_token_secret_list ~}
3840
- name: HF_TOKEN
3941
valueFrom:

benchmarks/benchmark/tools/profile-generator/modules/latency-profile/variables.tf

+6
Original file line numberDiff line numberDiff line change
@@ -153,3 +153,9 @@ variable "hugging_face_secret_version" {
153153
nullable = true
154154
default = null
155155
}
156+
157+
variable "scrape_server_metrics" {
158+
description = "Whether to scrape server metrics."
159+
type = bool
160+
default = false
161+
}

benchmarks/benchmark/tools/profile-generator/variables.tf

+6
Original file line numberDiff line numberDiff line change
@@ -144,4 +144,10 @@ variable "targets" {
144144
tokenizer = string
145145
})
146146
})
147+
}
148+
149+
variable "scrape_server_metrics" {
150+
description = "Whether to scrape server metrics."
151+
type = bool
152+
default = false
147153
}

0 commit comments

Comments
 (0)