Skip to content

Update readme.md with open-webui & envoy-ai-gateway usage #365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,13 @@ Easy, advanced inference platform for large language models on Kubernetes

- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
- **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./docs/open-webui.md).
- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), see configurations [here](./docs/open-webui.md).
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.

## Quick Start

Expand All @@ -51,7 +51,7 @@ Read the [Installation](./docs/installation.md) for guidance.
Here's a toy example for deploying `facebook/opt-125m`, all you need to do
is to apply a `Model` and a `Playground`.

If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here.
If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md).

> Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>` ahead.

Expand Down Expand Up @@ -118,14 +118,13 @@ curl http://localhost:8080/v1/completions \

### More than quick-start

If you want to learn more about this project, please refer to [develop.md](./docs/develop.md).
Please refer to [examples](./docs/examples/README.md) for more tutorials or read [develop.md](./docs/develop.md) to learn more about the project.

## Roadmap

- Gateway support for traffic routing
- Metrics support
- Serverless support for cloud-agnostic users
- CLI tool support
- Prefill-Decode disaggregated serving
- KV cache offload support
- Model training, fine tuning in the long-term

## Community
Expand Down
10 changes: 8 additions & 2 deletions chart/Chart.lock
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,11 @@ dependencies:
- name: open-webui
repository: https://helm.openwebui.com/
version: 6.4.0
digest: sha256:2520f6e26f2e6fd3e51c5f7f940eef94217c125a9828b0f59decedbecddcdb29
generated: "2025-04-21T00:50:06.532039+08:00"
- name: gateway-helm
repository: oci://registry-1.docker.io/envoyproxy/
version: 0.0.0-latest
- name: ai-gateway-helm
repository: oci://registry-1.docker.io/envoyproxy/
version: v0.0.0-latest
digest: sha256:c7b1aa22097a6a1a6f4dd04beed3287ab8ef2ae1aec8a9a4ec7a71251be23e4c
generated: "2025-04-22T20:15:43.343515+08:00"
12 changes: 6 additions & 6 deletions chart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ dependencies:
version: "6.4.0"
repository: "https://helm.openwebui.com/"
condition: open-webui.enabled
- name: envoy-gateway
version: v1.3.2
repository: oci://docker.io/envoyproxy/gateway-helm
- name: gateway-helm
version: 0.0.0-latest
repository: "oci://registry-1.docker.io/envoyproxy/"
condition: envoy-gateway.enabled
- name: envoy-ai-gateway
version: v0.1.5
repository: oci://docker.io/envoyproxy/ai-gateway-helm
- name: ai-gateway-helm
version: v0.0.0-latest
repository: "oci://registry-1.docker.io/envoyproxy/"
condition: envoy-ai-gateway.enabled
2 changes: 1 addition & 1 deletion chart/values.global.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ prometheus:
enabled: true

open-webui:
enabled: false
enabled: true
persistence:
enabled: false
enableOpenaiApi: true
Expand Down
106 changes: 106 additions & 0 deletions docs/envoy-ai-gateway.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Envoy AI Gateway

[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway
to handle request traffic from application clients to Generative AI services.

## How to use

### 1. Enable Envoy Gateway and Envoy AI Gateway

Both of them are enabled by default in `values.global.yaml` and will be deployed in llmaz-system.

```yaml
envoy-gateway:
enabled: true
envoy-ai-gateway:
enabled: true
```

However, [Envoy Gateway](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway](https://aigateway.envoyproxy.io/docs/getting-started/) can be deployed standalone in case you want to deploy them in other namespaces.

### 2. Basic AI Gateway Example

To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this.

We'll deploy two models `Qwen/Qwen2-0.5B-Instruct-GGUF` and `Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF` with llama.cpp (cpu only) and expose them via Envoy AI Gateway.

The full example is [here](./examples/envoy-ai-gateway/basic.yaml), apply it.

### 3. Check Envoy AI Gateway APIs

If Open-WebUI is enabled, you can chat via the webui (recommended), see [documentation](./open-webui.md). Otherwise, following the steps below to test the Envoy AI Gateway APIs.

I. Port-forwarding the `LoadBalancer` service in llmaz-system with port 8080.

II. Query `http://localhost:8008/v1/models | jq .`, available models will be listed. Expected response will look like this:

```json
{
"data": [
{
"id": "qwen2-0.5b",
"created": 1745327294,
"object": "model",
"owned_by": "Envoy AI Gateway"
},
{
"id": "qwen2.5-coder",
"created": 1745327294,
"object": "model",
"owned_by": "Envoy AI Gateway"
}
],
"object": "list"
}
```

III. Query `http://localhost:8080/v1/chat/completions` to chat with the model. Here, we ask the `qwen2-0.5b` model, the query will look like:

```bash
curl -H "Content-Type: application/json" -d '{
"model": "qwen2-0.5b",
"messages": [
{
"role": "system",
"content": "Hi."
}
]
}' http://localhost:8080/v1/chat/completions | jq .
```

Expected response will look like this:

```json
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
"created": 1745327371,
"model": "qwen2-0.5b",
"system_fingerprint": "b5124-bc091a4d",
"object": "chat.completion",
"usage": {
"completion_tokens": 10,
"prompt_tokens": 10,
"total_tokens": 20
},
"id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0",
"timings": {
"prompt_n": 1,
"prompt_ms": 319.876,
"prompt_per_token_ms": 319.876,
"prompt_per_second": 3.1262114069201816,
"predicted_n": 10,
"predicted_ms": 1309.393,
"predicted_per_token_ms": 130.9393,
"predicted_per_second": 7.63712651587415
}
}
```
101 changes: 0 additions & 101 deletions docs/examples/envoy-ai-gateway/README.md

This file was deleted.

Loading
Loading