Skip to content

Commit eae9604

Browse files
andyi2ityuzisun
andauthored
Document for autoscaling by using opentelemetry collector (#459)
* document for autoscaling by using opentelemetry collector Signed-off-by: Andrews Arokiam <[email protected]> * Update autoscaling_keda.md --------- Signed-off-by: Andrews Arokiam <[email protected]> Co-authored-by: Dan Sun <[email protected]>
1 parent afb6b16 commit eae9604

File tree

1 file changed

+108
-1
lines changed

1 file changed

+108
-1
lines changed

docs/modelserving/autoscaling/keda/autoscaling_keda.md

+108-1
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,15 @@ kubectl describe scaledobject sklearn-v2-iris-predictor
115115
...
116116
```
117117

118-
## Scale using the LLM Metrics
118+
## Scale using the Metrics
119+
120+
InferenceService scaling can be achieved in two ways:
121+
122+
- **Using Metrics via Prometheus**: Scale based on Large Language Model (LLM) metrics collected in Prometheus.
123+
124+
- **Using Metrics via OpenTelemetry**: Collect pod-level metrics (including LLM metrics) using OpenTelemetry, export them to the keda-otel-add-on gRPC endpoint, and use KEDA's external scaler for autoscaling.
125+
126+
## Autoscale based on metrics from Prometheus
119127

120128
Scale an InferenceService in Kubernetes using LLM (Large Language Model) metrics collected in Prometheus.
121129
The setup leverages KServe with KEDA for autoscaling based on custom [Prometheus metrics](../../../modelserving/observability/prometheus_metrics.md).
@@ -271,3 +279,102 @@ huggingface-fbopt-predictor-58f9c58b85-l69f7 1/1 Running
271279
huggingface-fbopt-predictor-58f9c58b85-l69f7 1/1 Running 0 51s
272280
```
273281

282+
## Autoscale by using OpenTelemetry Collector
283+
284+
[KEDA (Kubernetes Event-driven Autoscaler)](https://keda.sh) traditionally uses a polling mechanism to monitor trigger sources like Prometheus, Kubernetes API, and external event sources. While effective, polling can introduce latency and additional load on the cluster. The [otel-add-on](https://github.com/kedify/otel-add-on) enables OpenTelemetry-based push metrics for more efficient and real-time autoscaling, reducing the overhead associated with frequent polling.
285+
286+
### Prerequisites
287+
288+
1. Kubernetes cluster with KServe installed.
289+
290+
2. [KEDA installed](https://keda.sh/docs/2.9/deploy/#install) for event-driven autoscaling.
291+
292+
3. [OpenTelemetry Operator](https://github.com/open-telemetry/opentelemetry-operator) installed.
293+
294+
4. [kedify-otel-add-on](https://github.com/kedify/otel-add-on): Install the otel-add-on with the validation webhook disabled. Certain metrics, including the vLLM pattern (e.g., vllm:num_requests_running), fail to comply with the validation constraints enforced by the webhook.
295+
296+
=== "kubectl"
297+
```
298+
helm upgrade -i kedify-otel oci://ghcr.io/kedify/charts/otel-add-on --version=v0.0.6 --namespace keda --wait --set validatingAdmissionPolicy.enabled=false
299+
```
300+
301+
### Create `InferenceService`
302+
303+
``` yaml
304+
kubectl apply -f - <<EOF
305+
apiVersion: serving.kserve.io/v1beta1
306+
kind: InferenceService
307+
metadata:
308+
name: huggingface-fbopt
309+
annotations:
310+
serving.kserve.io/deploymentMode: "RawDeployment"
311+
serving.kserve.io/autoscalerClass: "keda"
312+
sidecar.opentelemetry.io/inject: "huggingface-fbopt-predictor"
313+
spec:
314+
predictor:
315+
model:
316+
modelFormat:
317+
name: huggingface
318+
args:
319+
- --model_name=fbopt
320+
- --model_id=facebook/opt-125m
321+
resources:
322+
limits:
323+
cpu: "1"
324+
memory: 4Gi
325+
requests:
326+
cpu: "1"
327+
memory: 4Gi
328+
minReplicas: 1
329+
maxReplicas: 5
330+
autoScaling:
331+
metrics:
332+
- type: PodMetric
333+
podmetric:
334+
metric:
335+
backend: "opentelemetry"
336+
metricNames:
337+
- vllm:num_requests_running
338+
query: "vllm:num_requests_running"
339+
target:
340+
type: Value
341+
value: "4"
342+
EOF
343+
```
344+
345+
The `sidecar.opentelemetry.io/inject` annotation ensures that an OpenTelemetry Collector runs as a sidecar container within the InferenceService pod. This collector is responsible for gathering pod-level metrics and forwarding them to the `otel-add-on` GRPC endpoint, which in turn enables KEDA's `scaledobject` to use these metrics for autoscaling decisions. The annotation must follow the pattern `<inferenceservice-name>-predictor`
346+
347+
!!! success "Expected Output"
348+
349+
```{ .bash .no-copy }
350+
$ inferenceservice.serving.kserve.io/huggingface-fbopt created
351+
```
352+
353+
Check KEDA `ScaledObject`:
354+
355+
=== "kubectl"
356+
```
357+
kubectl get scaledobjects huggingface-fbopt-predictor
358+
```
359+
360+
!!! success "Expected Output"
361+
362+
```{ .bash .no-copy }
363+
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
364+
huggingface-fbopt-predictor apps/v1.Deployment huggingface-fbopt-predictor 1 5 prometheus True False False Unknown 32m
365+
```
366+
367+
Check `OpenTelemetryCollector`:
368+
369+
=== "kubectl"
370+
```
371+
kubectl get opentelemetrycollector huggingface-fbopt-predictor
372+
```
373+
374+
!!! success "Expected Output"
375+
376+
```{ .bash .no-copy }
377+
NAME MODE VERSION READY AGE IMAGE MANAGEMENT
378+
huggingface-fbopt-predictor sidecar 0.120.0 8h managed
379+
```
380+

0 commit comments

Comments
 (0)