You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: add TensorRT-LLM as backend
* update readme
* update readme
* remove example to resolve conflicts
* remove example to resolve conflicts
* fix
* add tersorrt-llm example
* fix folder name
Copy file name to clipboardExpand all lines: README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ Easy, advanced inference platform for large language models on Kubernetes
39
39
## Key Features
40
40
41
41
-**Easy of Use**: People can quick deploy a LLM service with minimal configurations.
42
-
-**Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
42
+
-**Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
43
43
-**Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
44
44
-**Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
45
45
-**Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
@@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
46
47
47
48
[llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.
48
49
50
+
### Deploy models via TensorRT-LLM
51
+
52
+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs, see [example](./tensorrt-llm/) here.
53
+
49
54
### Deploy models via text-generation-inference
50
55
51
56
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
Copy file name to clipboardExpand all lines: site/content/en/docs/integrations/support-backends.md
+4
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
13
13
14
14
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
15
15
16
+
## TensorRT-LLM
17
+
18
+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
19
+
16
20
## Text-Generation-Inference
17
21
18
22
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
0 commit comments