diff --git a/README.md b/README.md index e2f8e439..ee99974c 100644 --- a/README.md +++ b/README.md @@ -32,13 +32,13 @@ Easy, advanced inference platform for large language models on Kubernetes - **Easy of Use**: People can quick deploy a LLM service with minimal configurations. - **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md). -- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing. - **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance. -- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes. - **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users. - **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. +- **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/). +- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./docs/open-webui.md). - **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds. -- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), see configurations [here](./docs/open-webui.md). +- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing. ## Quick Start @@ -51,7 +51,7 @@ Read the [Installation](./docs/installation.md) for guidance. Here's a toy example for deploying `facebook/opt-125m`, all you need to do is to apply a `Model` and a `Playground`. -If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here. +If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md). > Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=` ahead. @@ -118,14 +118,13 @@ curl http://localhost:8080/v1/completions \ ### More than quick-start -If you want to learn more about this project, please refer to [develop.md](./docs/develop.md). +Please refer to [examples](./docs/examples/README.md) for more tutorials or read [develop.md](./docs/develop.md) to learn more about the project. ## Roadmap -- Gateway support for traffic routing -- Metrics support - Serverless support for cloud-agnostic users -- CLI tool support +- Prefill-Decode disaggregated serving +- KV cache offload support - Model training, fine tuning in the long-term ## Community diff --git a/chart/Chart.lock b/chart/Chart.lock index a0da65ee..b4edee7f 100644 --- a/chart/Chart.lock +++ b/chart/Chart.lock @@ -2,5 +2,11 @@ dependencies: - name: open-webui repository: https://helm.openwebui.com/ version: 6.4.0 -digest: sha256:2520f6e26f2e6fd3e51c5f7f940eef94217c125a9828b0f59decedbecddcdb29 -generated: "2025-04-21T00:50:06.532039+08:00" +- name: gateway-helm + repository: oci://registry-1.docker.io/envoyproxy/ + version: 0.0.0-latest +- name: ai-gateway-helm + repository: oci://registry-1.docker.io/envoyproxy/ + version: v0.0.0-latest +digest: sha256:c7b1aa22097a6a1a6f4dd04beed3287ab8ef2ae1aec8a9a4ec7a71251be23e4c +generated: "2025-04-22T20:15:43.343515+08:00" diff --git a/chart/Chart.yaml b/chart/Chart.yaml index f452fc8e..56eaad2e 100644 --- a/chart/Chart.yaml +++ b/chart/Chart.yaml @@ -25,11 +25,11 @@ dependencies: version: "6.4.0" repository: "https://helm.openwebui.com/" condition: open-webui.enabled - - name: envoy-gateway - version: v1.3.2 - repository: oci://docker.io/envoyproxy/gateway-helm + - name: gateway-helm + version: 0.0.0-latest + repository: "oci://registry-1.docker.io/envoyproxy/" condition: envoy-gateway.enabled - - name: envoy-ai-gateway - version: v0.1.5 - repository: oci://docker.io/envoyproxy/ai-gateway-helm + - name: ai-gateway-helm + version: v0.0.0-latest + repository: "oci://registry-1.docker.io/envoyproxy/" condition: envoy-ai-gateway.enabled diff --git a/chart/values.global.yaml b/chart/values.global.yaml index ad9de873..04f2f5e2 100644 --- a/chart/values.global.yaml +++ b/chart/values.global.yaml @@ -34,7 +34,7 @@ prometheus: enabled: true open-webui: - enabled: false + enabled: true persistence: enabled: false enableOpenaiApi: true diff --git a/docs/envoy-ai-gateway.md b/docs/envoy-ai-gateway.md new file mode 100644 index 00000000..69d6d920 --- /dev/null +++ b/docs/envoy-ai-gateway.md @@ -0,0 +1,106 @@ +# Envoy AI Gateway + +[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway +to handle request traffic from application clients to Generative AI services. + +## How to use + +### 1. Enable Envoy Gateway and Envoy AI Gateway + +Both of them are enabled by default in `values.global.yaml` and will be deployed in llmaz-system. + +```yaml +envoy-gateway: + enabled: true +envoy-ai-gateway: + enabled: true +``` + +However, [Envoy Gateway](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway](https://aigateway.envoyproxy.io/docs/getting-started/) can be deployed standalone in case you want to deploy them in other namespaces. + +### 2. Basic AI Gateway Example + +To expose your models via Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this. + +We'll deploy two models `Qwen/Qwen2-0.5B-Instruct-GGUF` and `Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF` with llama.cpp (cpu only) and expose them via Envoy AI Gateway. + +The full example is [here](./examples/envoy-ai-gateway/basic.yaml), apply it. + +### 3. Check Envoy AI Gateway APIs + +If Open-WebUI is enabled, you can chat via the webui (recommended), see [documentation](./open-webui.md). Otherwise, following the steps below to test the Envoy AI Gateway APIs. + +I. Port-forwarding the `LoadBalancer` service in llmaz-system with port 8080. + +II. Query `http://localhost:8008/v1/models | jq .`, available models will be listed. Expected response will look like this: + +```json +{ + "data": [ + { + "id": "qwen2-0.5b", + "created": 1745327294, + "object": "model", + "owned_by": "Envoy AI Gateway" + }, + { + "id": "qwen2.5-coder", + "created": 1745327294, + "object": "model", + "owned_by": "Envoy AI Gateway" + } + ], + "object": "list" +} +``` + +III. Query `http://localhost:8080/v1/chat/completions` to chat with the model. Here, we ask the `qwen2-0.5b` model, the query will look like: + +```bash +curl -H "Content-Type: application/json" -d '{ + "model": "qwen2-0.5b", + "messages": [ + { + "role": "system", + "content": "Hi." + } + ] + }' http://localhost:8080/v1/chat/completions | jq . +``` + +Expected response will look like this: + +```json +{ + "choices": [ + { + "finish_reason": "stop", + "index": 0, + "message": { + "role": "assistant", + "content": "Hello! How can I assist you today?" + } + } + ], + "created": 1745327371, + "model": "qwen2-0.5b", + "system_fingerprint": "b5124-bc091a4d", + "object": "chat.completion", + "usage": { + "completion_tokens": 10, + "prompt_tokens": 10, + "total_tokens": 20 + }, + "id": "chatcmpl-AODlT8xnf4OjJwpQH31XD4yehHLnurr0", + "timings": { + "prompt_n": 1, + "prompt_ms": 319.876, + "prompt_per_token_ms": 319.876, + "prompt_per_second": 3.1262114069201816, + "predicted_n": 10, + "predicted_ms": 1309.393, + "predicted_per_token_ms": 130.9393, + "predicted_per_second": 7.63712651587415 + } +} +``` diff --git a/docs/examples/envoy-ai-gateway/README.md b/docs/examples/envoy-ai-gateway/README.md deleted file mode 100644 index 1222dacd..00000000 --- a/docs/examples/envoy-ai-gateway/README.md +++ /dev/null @@ -1,101 +0,0 @@ -# Envoy AI Gateway - -[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway -to handle request traffic from application clients to Generative AI services. - -## How to use - -### 1. Enable Envoy Gateway and Envoy AI Gateway in llmaz Helm - -Enable Envoy Gateway and Envoy AI Gateway in the `values.global.yaml` file, envoy gateway and envoy ai gateway are disabled by default. - -```yaml -envoy-gateway: - enabled: true -envoy-ai-gateway: - enabled: true -``` - -Note: [Envoy Gateway installation](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway installation](https://aigateway.envoyproxy.io/docs/getting-started/) can be done standalone. - -### 2. Check Envoy Gateway and Envoy AI Gateway - -Run `kubectl wait --timeout=5m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available` to wait for the envoy gateway to be ready. - -Run `kubectl wait --timeout=2m -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available` to wait for the envoy ai gateway to be ready. - -### 3. Basic AI Gateway example - -To expose your model(Playground) to Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this. - -Example [qwen playground](docs/examples/llamacpp/playground.yaml) configuration for a basic AI Gateway. -The model name is `qwen2-0.5b`, so the backend ref name is `qwen2-0--5b`, and the model lb service: `qwen2-0--5b-lb` -- Playground in [docs/examples/llamacpp/playground.yaml](docs/examples/llamacpp/playground.yaml) -- GatewayClass in [docs/examples/envoy-ai-gateway/basic.yaml](docs/examples/envoy-ai-gateway/basic.yaml) - -Check if the gateway pod to be ready: - -```bash -kubectl wait pods --timeout=2m \ - -l gateway.envoyproxy.io/owning-gateway-name=envoy-ai-gateway-basic \ - -n envoy-gateway-system \ - --for=condition=Ready -``` - -### 4. Check Envoy AI Gateway APIs - -- For local test with port forwarding, use `export GATEWAY_URL="http://localhost:8080"`. -- Using external IP, use `export GATEWAY_URL=$(kubectl get gateway/envoy-ai-gateway-basic -o jsonpath='{.status.addresses[0].value}')` - -See https://aigateway.envoyproxy.io/docs/getting-started/basic-usage for more details. - -`$GATEWAY_URL/v1/models` will show the models that are available in the Envoy AI Gateway. The response will look like this: - -```json -{ - "data": [ - { - "id": "some-cool-self-hosted-model", - "created": 1744880950, - "object": "model", - "owned_by": "Envoy AI Gateway" - }, - { - "id": "qwen2-0.5b", - "created": 1744880950, - "object": "model", - "owned_by": "Envoy AI Gateway" - } - ], - "object": "list" -} -``` - -`$GATEWAY_URL/v1/chat/completions` will show the chat completions for the model. The request will look like this: - -```bash -curl -H "Content-Type: application/json" -d '{ - "model": "qwen2-0.5b", - "messages": [ - { - "role": "system", - "content": "Hi." - } - ] - }' $GATEWAY_URL/v1/chat/completions -``` - -Expected response will look like this: - -```json -{ - "choices": [ - { - "message": { - "content": "I'll be back." - } - } - ] -} -``` - diff --git a/docs/examples/envoy-ai-gateway/basic.yaml b/docs/examples/envoy-ai-gateway/basic.yaml index 2e2f79e1..0e5094b5 100644 --- a/docs/examples/envoy-ai-gateway/basic.yaml +++ b/docs/examples/envoy-ai-gateway/basic.yaml @@ -1,17 +1,67 @@ +apiVersion: llmaz.io/v1alpha1 +kind: OpenModel +metadata: + name: qwen2-0--5b +spec: + familyName: qwen2 + source: + modelHub: + modelID: Qwen/Qwen2-0.5B-Instruct-GGUF + filename: qwen2-0_5b-instruct-q5_k_m.gguf +--- +apiVersion: inference.llmaz.io/v1alpha1 +kind: Playground +metadata: + name: qwen2-0--5b +spec: + replicas: 1 + modelClaim: + modelName: qwen2-0--5b + backendRuntimeConfig: + backendName: llamacpp + configName: default + args: + - -fa # use flash attention +--- +apiVersion: llmaz.io/v1alpha1 +kind: OpenModel +metadata: + name: qwen2--5-coder +spec: + familyName: qwen2 + source: + modelHub: + modelID: Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF + filename: qwen2.5-coder-0.5b-instruct-q2_k.gguf +--- +apiVersion: inference.llmaz.io/v1alpha1 +kind: Playground +metadata: + name: qwen2--5-coder +spec: + replicas: 1 + modelClaim: + modelName: qwen2--5-coder + backendRuntimeConfig: + backendName: llamacpp + configName: default + args: + - -fa # use flash attention +--- apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: - name: envoy-ai-gateway-basic + name: default-envoy-ai-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: - name: envoy-ai-gateway-basic + name: default-envoy-ai-gateway namespace: default spec: - gatewayClassName: envoy-ai-gateway-basic + gatewayClassName: default-envoy-ai-gateway listeners: - name: http protocol: HTTP @@ -20,35 +70,57 @@ spec: apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIGatewayRoute metadata: - name: envoy-ai-gateway-basic + name: default-envoy-ai-gateway namespace: default spec: schema: name: OpenAI targetRefs: - - name: envoy-ai-gateway-basic + - name: default-envoy-ai-gateway kind: Gateway group: gateway.networking.k8s.io rules: - -# Above are basic config for envoy ai gateway -# Below is example for qwen2-0.5b: a matched backend ref and the AIServiceBackend - matches: - headers: - type: Exact name: x-ai-eg-model value: qwen2-0.5b backendRefs: - - name: envoy-ai-gateway-llmaz-model-1 + - name: qwen2-0--5b + - matches: + - headers: + - type: Exact + name: x-ai-eg-model + value: qwen2.5-coder + backendRefs: + - name: qwen2--5-coder --- apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIServiceBackend metadata: - name: envoy-ai-gateway-llmaz-model-1 + name: qwen2-0--5b namespace: default spec: + timeouts: + request: 3m schema: name: OpenAI backendRef: name: qwen2-0--5b-lb - kind: Service \ No newline at end of file + kind: Service + port: 8080 +--- +apiVersion: aigateway.envoyproxy.io/v1alpha1 +kind: AIServiceBackend +metadata: + name: qwen2--5-coder + namespace: default +spec: + timeouts: + request: 3m + schema: + name: OpenAI + backendRef: + name: qwen2--5-coder-lb + kind: Service + port: 8080 diff --git a/docs/examples/envoy-ai-gateway/envoy-ai-gateway.md b/docs/examples/envoy-ai-gateway/envoy-ai-gateway.md deleted file mode 100644 index 5681d61a..00000000 --- a/docs/examples/envoy-ai-gateway/envoy-ai-gateway.md +++ /dev/null @@ -1,102 +0,0 @@ -# Envoy AI Gateway - -[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open source project for using Envoy Gateway -to handle request traffic from application clients to Generative AI services. - -## How to use - -### 1. Enable Envoy Gateway and Envoy AI Gateway in llmaz Helm - -Enable Envoy Gateway and Envoy AI Gateway in the `values.global.yaml` file, envoy gateway and envoy ai gateway are enabled by default. - -```yaml -envoy-gateway: - enabled: true -envoy-ai-gateway: - enabled: true -``` - -Note: [Envoy Gateway installation](https://gateway.envoyproxy.io/latest/install/install-helm/) and [Envoy AI Gateway installation](https://aigateway.envoyproxy.io/docs/getting-started/) can be done standalone. - -### 2. Check Envoy Gateway and Envoy AI Gateway - -Run `kubectl wait --timeout=5m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available` to wait for the envoy gateway to be ready. - -Run `kubectl wait --timeout=2m -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available` to wait for the envoy ai gateway to be ready. - -### 3. Basic AI Gateway example - -To expose your model(Playground) to Envoy Gateway, you need to create a GatewayClass, Gateway, and AIGatewayRoute. The following example shows how to do this. - -Example [qwen playground](docs/examples/llamacpp/playground.yaml) configuration for a basic AI Gateway. -The model name is `qwen2-0.5b`, so the backend ref name is `qwen2-0--5b`, and the model lb service: `qwen2-0--5b-lb` - -- Playground in [docs/examples/llamacpp/playground.yaml](docs/examples/llamacpp/playground.yaml) -- GatewayClass in [docs/examples/envoy-ai-gateway/basic.yaml](docs/examples/envoy-ai-gateway/basic.yaml) - -Check if the gateway pod to be ready: - -```bash -kubectl wait pods --timeout=2m \ - -l gateway.envoyproxy.io/owning-gateway-name=envoy-ai-gateway-basic \ - -n envoy-gateway-system \ - --for=condition=Ready -``` - -### 4. Check Envoy AI Gateway APIs - -- For local test with port forwarding, use `export GATEWAY_URL="http://localhost:8080"`. -- Using external IP, use `export GATEWAY_URL=$(kubectl get gateway/envoy-ai-gateway-basic -o jsonpath='{.status.addresses[0].value}')` - -See https://aigateway.envoyproxy.io/docs/getting-started/basic-usage for more details. - -`$GATEWAY_URL/v1/models` will show the models that are available in the Envoy AI Gateway. The response will look like this: - -```json -{ - "data": [ - { - "id": "some-cool-self-hosted-model", - "created": 1744880950, - "object": "model", - "owned_by": "Envoy AI Gateway" - }, - { - "id": "qwen2-0.5b", - "created": 1744880950, - "object": "model", - "owned_by": "Envoy AI Gateway" - } - ], - "object": "list" -} -``` - -`$GATEWAY_URL/v1/chat/completions` will show the chat completions for the model. The request will look like this: - -```bash -curl -H "Content-Type: application/json" -d '{ - "model": "qwen2-0.5b", - "messages": [ - { - "role": "system", - "content": "Hi." - } - ] - }' $GATEWAY_URL/v1/chat/completions -``` - -Expected response will look like this: - -```json -{ - "choices": [ - { - "message": { - "content": "I'll be back." - } - } - ] -} -``` - diff --git a/docs/installation.md b/docs/installation.md index e9265f44..a3914868 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -2,10 +2,16 @@ ## Prerequisites +**Requirements**: + - Kubernetes version >= 1.27 - Helm 3, see [installation](https://helm.sh/docs/intro/install/). - Prometheus, see [installation](https://github.com/InftyAI/llmaz/tree/main/docs/prometheus-operator#install-the-prometheus-operator). +Note: llmaz helm chart will by default install +- [Envoy Gateway](https://github.com/envoyproxy/gateway) and [Envoy AI Gateway](https://github.com/envoyproxy/gateway) as the frontier in the llmaz-system, if you *already installed these two components* or *want to deploy in other namespaces* , append `--set envoy-gateway.enabled=false --set envoy-ai-gateway.enabled=false` to the command below. +- [Open WebUI](https://github.com/open-webui/open-webui) as the default chatbot, if you want to disable it, append `--set open-webui.enabled=false` to the command below. + ## Install a released version ### Install @@ -35,6 +41,13 @@ kubectl delete crd \ ## Install from source +### Change configurations + +If you want to change the default configurations, please change the values in [values.global.yaml](../chart/values.global.yaml). + +**Do you change** the values in _values.yaml_ because it's auto-generated and will be overwritten. + + ### Install ```cmd @@ -60,16 +73,6 @@ kubectl delete crd \ services.inference.llmaz.io ``` -## Change configurations - -If you want to change the default configurations, please change the values in [values.global.yaml](../chart/values.global.yaml), then run - -```cmd -make helm-install -``` - -**Do you change** the values in _values.yaml_ because it's auto-generated and will be overwritten. - ## Upgrade Once you changed your code, run the command to upgrade the controller: diff --git a/docs/open-webui.md b/docs/open-webui.md index c673be08..d22f1534 100644 --- a/docs/open-webui.md +++ b/docs/open-webui.md @@ -5,11 +5,11 @@ ## Prerequisites - Make sure you're located in **llmaz-system** namespace, haven't tested with other namespaces. -- Make sure [EnvoyGateway](https://github.com/envoyproxy/gateway) and [Envoy AI Gateway](https://github.com/envoyproxy/ai-gateway) are installed, both of them are installed by default in llmaz. See [Envoy AI Gateway](docs/envoy-ai-gateway.md) for more details. +- Make sure [EnvoyGateway](https://github.com/envoyproxy/gateway) and [Envoy AI Gateway](https://github.com/envoyproxy/ai-gateway) are installed, both of them are installed by default in llmaz. See [AI Gateway](docs/envoy-ai-gateway.md) for more details. ## How to use -1. Enable Open WebUI in the `values.global.yaml` file, open-webui is disabled by default. +1. Enable Open WebUI in the `values.global.yaml` file, open-webui is enabled by default. ```yaml open-webui: @@ -18,7 +18,7 @@ > Optional to set the `persistence=true` to persist the data, recommended for production. -2. Run `kubectl get svc -n envoy-gateway-system` to list out the services, the output looks like: +2. Run `kubectl get svc -n llmaz-system` to list out the services, the output looks like: ```cmd envoy-default-default-envoy-ai-gateway-dbec795a LoadBalancer 10.96.145.150 80:30548/TCP 132m @@ -30,7 +30,7 @@ ```yaml open-webui: enabled: true - openaiBaseApiUrl: http://envoy-default-default-envoy-ai-gateway-dbec795a.envoy-gateway-system.svc.cluster.local/v1 + openaiBaseApiUrl: http://envoy-default-default-envoy-ai-gateway-dbec795a.llmaz-system.svc.cluster.local/v1 ``` 4. Run `make install-chatbot` to install the chatbot.