Skip to content

Commit fe74a6d

Browse files
authored
feat: add TensorRT-LLM as backend (#392)
* feat: add TensorRT-LLM as backend * update readme * update readme * remove example to resolve conflicts * remove example to resolve conflicts * fix * add tersorrt-llm example * fix folder name
1 parent 28c68bb commit fe74a6d

File tree

7 files changed

+146
-5
lines changed

7 files changed

+146
-5
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Easy, advanced inference platform for large language models on Kubernetes
3939
## Key Features
4040

4141
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
42-
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
42+
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
4343
- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
4444
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
4545
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
{{- if .Values.backendRuntime.enabled -}}
2+
apiVersion: inference.llmaz.io/v1alpha1
3+
kind: BackendRuntime
4+
metadata:
5+
labels:
6+
app.kubernetes.io/name: backendruntime
7+
app.kubernetes.io/part-of: llmaz
8+
app.kubernetes.io/created-by: llmaz
9+
name: tensorrt-llm
10+
spec:
11+
command:
12+
- trtllm-serve
13+
image: {{ .Values.backendRuntime.tensorrt_llm.image.repository }}
14+
version: {{ .Values.backendRuntime.tensorrt_llm.image.tag }}
15+
# Do not edit the preset argument name unless you know what you're doing.
16+
# Free to add more arguments with your requirements.
17+
recommendedConfigs:
18+
- name: default
19+
args:
20+
- "{{`{{ .ModelPath }}`}}"
21+
- --host
22+
- "0.0.0.0"
23+
- --port
24+
- "8080"
25+
resources:
26+
requests:
27+
cpu: 4
28+
memory: 16Gi
29+
limits:
30+
cpu: 4
31+
memory: 16Gi
32+
startupProbe:
33+
periodSeconds: 10
34+
failureThreshold: 30
35+
httpGet:
36+
path: /health
37+
port: 8080
38+
livenessProbe:
39+
initialDelaySeconds: 15
40+
periodSeconds: 10
41+
failureThreshold: 3
42+
httpGet:
43+
path: /health
44+
port: 8080
45+
readinessProbe:
46+
initialDelaySeconds: 5
47+
periodSeconds: 5
48+
failureThreshold: 3
49+
httpGet:
50+
path: /health
51+
port: 8080
52+
{{- end }}

chart/values.global.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ backendRuntime:
1414
image:
1515
repository: lmsysorg/sglang
1616
tag: v0.4.5-cu121
17+
tensorrt_llm:
18+
image:
19+
repository: nvcr.io/nvidia/tritonserver
20+
tag: 25.03-trtllm-python-py3
1721
tgi:
1822
image:
1923
repository: ghcr.io/huggingface/text-generation-inference

docs/examples/README.md

+9-4
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,12 @@ We provide a set of examples to help you serve large language models, by default
99
- [Deploy models from ObjectStore](#deploy-models-from-objectstore)
1010
- [Deploy models via SGLang](#deploy-models-via-sglang)
1111
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
12-
- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
13-
- [Deploy models via ollama](#ollama)
12+
- [Deploy models via TensorRT-LLM](#deploy-models-via-tensorrt-llm)
13+
- [Deploy models via text-generation-inference](#deploy-models-via-text-generation-inference)
14+
- [Deploy models via ollama](#deploy-models-via-ollama)
1415
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
15-
- [Deploy multi-host inference](#multi-host-inference)
16-
- [Deploy host models](#deploy-host-models)
16+
- [Multi-Host Inference](#multi-host-inference)
17+
- [Deploy Host Models](#deploy-host-models)
1718
- [Envoy AI Gateway](#envoy-ai-gateway)
1819

1920
### Deploy models from Huggingface
@@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
4647

4748
[llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.
4849

50+
### Deploy models via TensorRT-LLM
51+
52+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs, see [example](./tensorrt-llm/) here.
53+
4954
### Deploy models via text-generation-inference
5055

5156
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: llmaz.io/v1alpha1
2+
kind: OpenModel
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
familyName: qwen2
7+
source:
8+
modelHub:
9+
modelID: Qwen/Qwen2-0.5B-Instruct
10+
inferenceConfig:
11+
flavors:
12+
- name: a10 # GPU type
13+
limits:
14+
nvidia.com/gpu: 1
15+
---
16+
apiVersion: inference.llmaz.io/v1alpha1
17+
kind: Playground
18+
metadata:
19+
name: qwen2-0--5b
20+
spec:
21+
replicas: 1
22+
modelClaim:
23+
modelName: qwen2-0--5b
24+
backendRuntimeConfig:
25+
backendName: tensorrt-llm

site/content/en/docs/integrations/support-backends.md

+4
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
1313

1414
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
1515

16+
## TensorRT-LLM
17+
18+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
19+
1620
## Text-Generation-Inference
1721

1822
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: BackendRuntime
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: backendruntime
6+
app.kubernetes.io/part-of: llmaz
7+
app.kubernetes.io/created-by: llmaz
8+
name: tensorrt-llm
9+
spec:
10+
command:
11+
- trtllm-serve
12+
image: nvcr.io/nvidia/tritonserver
13+
version: 25.03-trtllm-python-py3
14+
# Do not edit the preset argument name unless you know what you're doing.
15+
# Free to add more arguments with your requirements.
16+
recommendedConfigs:
17+
- name: default
18+
args:
19+
- "{{`{{ .ModelPath }}`}}"
20+
- --host
21+
- "0.0.0.0"
22+
- --port
23+
- "8080"
24+
sharedMemorySize: 2Gi
25+
resources:
26+
requests:
27+
cpu: 4
28+
memory: 16Gi
29+
limits:
30+
cpu: 4
31+
memory: 16Gi
32+
startupProbe:
33+
periodSeconds: 10
34+
failureThreshold: 30
35+
httpGet:
36+
path: /health
37+
port: 8080
38+
livenessProbe:
39+
initialDelaySeconds: 15
40+
periodSeconds: 10
41+
failureThreshold: 3
42+
httpGet:
43+
path: /health
44+
port: 8080
45+
readinessProbe:
46+
initialDelaySeconds: 5
47+
periodSeconds: 5
48+
failureThreshold: 3
49+
httpGet:
50+
path: /health
51+
port: 8080

0 commit comments

Comments
 (0)