Releases: runpod-workers/worker-vllm
Releases · runpod-workers/worker-vllm
0.2.3
Worker vLLM 0.2.3 - What's Changed
Various bug fixes
New Contributors
- @casper-hansen made their first contribution in #39
- @willsamu made their first contribution in #45
0.2.2
Worker vLLM 0.2.2 - What's New
- Custom Chat Templates: you may now specify a Jinja chat template with an environment variable.
- Custom Tokenizer
Fixes:
- Tensor Parallel/Multi-GPU Deployment
- Baking Model into the image. Previously, the worker would download the model every time, ignoring the baked in model.
- Crashes due to
MAX_PARALLEL_LOADING_WORKERS
0.2.1
Worker vLLM 0.2.1 - What's New
- Added OpenAI Chat Completions formatted output for non-streaming use. (previously only supported for streaming)
0.2.0
Worker vLLM 0.2.0 - What's New
- You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
- Over 3x lighter Docker image size.
- OpenAI Chat Completion output format (optional to use).
- Fast image build time.
- Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
- Support for
n
andbest_of
sampling parameters, which allow you to generate multiple responses from a single prompt. - New environment variables for various configuration.
- vLLM Version: 0.2.7
0.1.0
What's Changed
- Fixed STREAMING environment variable not being interpreted as boolean. by @vladmihaisima in #4
- 10x Faster New Worker by @alpayariyak in #18
- Update runpod package version by @github-actions in #19
- fix: update badge by @justinmerrell in #20
- Chat Template Feature, Message List, Small Refactor by @alpayariyak in #27
New Contributors
- @vladmihaisima made their first contribution in #4
- @alpayariyak made their first contribution in #18
- @github-actions made their first contribution in #19
- @justinmerrell made their first contribution in #20
Full Changelog: https://github.com/runpod-workers/worker-vllm/commits/0.1.0