feat: add TensorRT-LLM as backend (#392)

cr7258 · web-flow · commit fe74a6d0c0bd · 2025-05-06T20:04:08.000+08:00
* feat: add TensorRT-LLM as backend

* update readme

* update readme

* remove example to resolve conflicts

* remove example to resolve conflicts

* fix

* add  tersorrt-llm example

* fix folder name
diff --git a/README.md b/README.md
@@ -39,7 +39,7 @@ Easy, advanced inference platform for large language models on Kubernetes
 ## Key Features
 
 - **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
-- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
+- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
 - **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
 - **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
 - **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. 
diff --git a/chart/templates/backends/tensorrt-llm.yaml b/chart/templates/backends/tensorrt-llm.yaml
@@ -0,0 +1,52 @@
+{{- if .Values.backendRuntime.enabled -}}
+apiVersion: inference.llmaz.io/v1alpha1
+kind: BackendRuntime
+metadata:
+  labels:
+    app.kubernetes.io/name: backendruntime
+    app.kubernetes.io/part-of: llmaz
+    app.kubernetes.io/created-by: llmaz
+  name: tensorrt-llm
+spec:
+  command:
+    - trtllm-serve
+  image: {{ .Values.backendRuntime.tensorrt_llm.image.repository }}
+  version: {{ .Values.backendRuntime.tensorrt_llm.image.tag }}
+  # Do not edit the preset argument name unless you know what you're doing.
+  # Free to add more arguments with your requirements.
+  recommendedConfigs:
+    - name: default
+      args:
+        - "{{`{{ .ModelPath }}`}}"
+        - --host
+        - "0.0.0.0"
+        - --port
+        - "8080"
+      resources:
+        requests:
+          cpu: 4
+          memory: 16Gi
+        limits:
+          cpu: 4
+          memory: 16Gi
+  startupProbe:
+    periodSeconds: 10
+    failureThreshold: 30
+    httpGet:
+      path: /health
+      port: 8080
+  livenessProbe:
+    initialDelaySeconds: 15
+    periodSeconds: 10
+    failureThreshold: 3
+    httpGet:
+      path: /health
+      port: 8080
+  readinessProbe:
+    initialDelaySeconds: 5
+    periodSeconds: 5
+    failureThreshold: 3
+    httpGet:
+      path: /health
+      port: 8080
+  {{- end }}
diff --git a/chart/values.global.yaml b/chart/values.global.yaml
@@ -14,6 +14,10 @@ backendRuntime:
     image:
       repository: lmsysorg/sglang
       tag: v0.4.5-cu121
+  tensorrt_llm:
+    image:
+      repository: nvcr.io/nvidia/tritonserver
+      tag: 25.03-trtllm-python-py3
   tgi:
     image:
       repository: ghcr.io/huggingface/text-generation-inference
diff --git a/docs/examples/README.md b/docs/examples/README.md
@@ -9,11 +9,12 @@ We provide a set of examples to help you serve large language models, by default
 - [Deploy models from ObjectStore](#deploy-models-from-objectstore)
 - [Deploy models via SGLang](#deploy-models-via-sglang)
 - [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
-- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
-- [Deploy models via ollama](#ollama)
+- [Deploy models via TensorRT-LLM](#deploy-models-via-tensorrt-llm)
+- [Deploy models via text-generation-inference](#deploy-models-via-text-generation-inference)
+- [Deploy models via ollama](#deploy-models-via-ollama)
 - [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
-- [Deploy multi-host inference](#multi-host-inference)
-- [Deploy host models](#deploy-host-models)
+- [Multi-Host Inference](#multi-host-inference)
+- [Deploy Host Models](#deploy-host-models)
 - [Envoy AI Gateway](#envoy-ai-gateway)
 
 ### Deploy models from Huggingface
@@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
 
 [llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.
 
+### Deploy models via TensorRT-LLM
+
+[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs, see [example](./tensorrt-llm/) here.
+
 ### Deploy models via text-generation-inference
 
 [text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
diff --git a/docs/examples/tensorrt-llm/playground.yaml b/docs/examples/tensorrt-llm/playground.yaml
@@ -0,0 +1,25 @@
+apiVersion: llmaz.io/v1alpha1
+kind: OpenModel
+metadata:
+  name: qwen2-0--5b
+spec:
+  familyName: qwen2
+  source:
+    modelHub:
+      modelID: Qwen/Qwen2-0.5B-Instruct
+  inferenceConfig:
+    flavors:
+      - name: a10 # GPU type
+        limits:
+          nvidia.com/gpu: 1
+---
+apiVersion: inference.llmaz.io/v1alpha1
+kind: Playground
+metadata:
+  name: qwen2-0--5b
+spec:
+  replicas: 1
+  modelClaim:
+    modelName: qwen2-0--5b
+  backendRuntimeConfig:
+    backendName: tensorrt-llm
diff --git a/site/content/en/docs/integrations/support-backends.md b/site/content/en/docs/integrations/support-backends.md
@@ -13,6 +13,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
 
 [SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
 
+## TensorRT-LLM
+
+[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
+
 ## Text-Generation-Inference
 
 [text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
diff --git a/test/config/backends/tensorrt-llm.yaml b/test/config/backends/tensorrt-llm.yaml
@@ -0,0 +1,51 @@
+apiVersion: inference.llmaz.io/v1alpha1
+kind: BackendRuntime
+metadata:
+  labels:
+    app.kubernetes.io/name: backendruntime
+    app.kubernetes.io/part-of: llmaz
+    app.kubernetes.io/created-by: llmaz
+  name: tensorrt-llm
+spec:
+  command:
+    - trtllm-serve
+  image: nvcr.io/nvidia/tritonserver
+  version: 25.03-trtllm-python-py3
+  # Do not edit the preset argument name unless you know what you're doing.
+  # Free to add more arguments with your requirements.
+  recommendedConfigs:
+    - name: default
+      args:
+        - "{{`{{ .ModelPath }}`}}"
+        - --host
+        - "0.0.0.0"
+        - --port
+        - "8080"
+      sharedMemorySize: 2Gi
+      resources:
+        requests:
+          cpu: 4
+          memory: 16Gi
+        limits:
+          cpu: 4
+          memory: 16Gi
+  startupProbe:
+    periodSeconds: 10
+    failureThreshold: 30
+    httpGet:
+      path: /health
+      port: 8080
+  livenessProbe:
+    initialDelaySeconds: 15
+    periodSeconds: 10
+    failureThreshold: 3
+    httpGet:
+      path: /health
+      port: 8080
+  readinessProbe:
+    initialDelaySeconds: 5
+    periodSeconds: 5
+    failureThreshold: 3
+    httpGet:
+      path: /health
+      port: 8080