You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
InferenceService scaling can be achieved in two ways:
121
+
122
+
-**Using Metrics via Prometheus**: Scale based on Large Language Model (LLM) metrics collected in Prometheus.
123
+
124
+
-**Using Metrics via OpenTelemetry**: Collect pod-level metrics (including LLM metrics) using OpenTelemetry, export them to the keda-otel-add-on gRPC endpoint, and use KEDA's external scaler for autoscaling.
125
+
126
+
## Autoscale based on metrics from Prometheus
119
127
120
128
Scale an InferenceService in Kubernetes using LLM (Large Language Model) metrics collected in Prometheus.
121
129
The setup leverages KServe with KEDA for autoscaling based on custom [Prometheus metrics](../../../modelserving/observability/prometheus_metrics.md).
[KEDA (Kubernetes Event-driven Autoscaler)](https://keda.sh) traditionally uses a polling mechanism to monitor trigger sources like Prometheus, Kubernetes API, and external event sources. While effective, polling can introduce latency and additional load on the cluster. The [otel-add-on](https://github.com/kedify/otel-add-on) enables OpenTelemetry-based push metrics for more efficient and real-time autoscaling, reducing the overhead associated with frequent polling.
285
+
286
+
### Prerequisites
287
+
288
+
1. Kubernetes cluster with KServe installed.
289
+
290
+
2.[KEDA installed](https://keda.sh/docs/2.9/deploy/#install) for event-driven autoscaling.
4.[kedify-otel-add-on](https://github.com/kedify/otel-add-on): Install the otel-add-on with the validation webhook disabled. Certain metrics, including the vLLM pattern (e.g., vllm:num_requests_running), fail to comply with the validation constraints enforced by the webhook.
The `sidecar.opentelemetry.io/inject` annotation ensures that an OpenTelemetry Collector runs as a sidecar container within the InferenceService pod. This collector is responsible for gathering pod-level metrics and forwarding them to the `otel-add-on` GRPC endpoint, which in turn enables KEDA's `scaledobject` to use these metrics for autoscaling decisions. The annotation must follow the pattern `<inferenceservice-name>-predictor`
346
+
347
+
!!! success "Expected Output"
348
+
349
+
```{ .bash .no-copy }
350
+
$ inferenceservice.serving.kserve.io/huggingface-fbopt created
351
+
```
352
+
353
+
Check KEDA `ScaledObject`:
354
+
355
+
=== "kubectl"
356
+
```
357
+
kubectl get scaledobjects huggingface-fbopt-predictor
358
+
```
359
+
360
+
!!! success "Expected Output"
361
+
362
+
```{ .bash .no-copy }
363
+
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
0 commit comments