Release Release v0.4.6 · sgl-project/sglang

Highlights

Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, QWen, Llama, etc). #4709 (comment)
PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
Preliminary support for blackwell #5303

Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

Coming Soon

Large scale expert parallelism + PD disaggregation #4734 #5524
Pipeline Parallelism #5724
MLA Cutlass Backend #5390

What's Changed

[ci] fix llama4 ci error by @BBuf in #5126
Refactor and Optimize FA3 Code by @hebiao064 in #5090
Add Llama4 user guide by @ispobock in #5133
[Misc] Use pytest.mark.skipif in sgl-kernel test by @yinfan98 in #5137
feat: disable grammar restrictions within reasoning sections by @minleminzui in #4984
[modelopt] automatically inspect if model is ModelOpt quantized and set quantization method by @yundai424 in #5145
[AMD] Fix missing per_token_group_quant_fp8 for ROCm by @hubertlu-tw in #5140
fix multimodal hash feature by @huangtingwei9988 in #5083
Fix run time error in ROCm platform by @kkHuang-amd in #5147
[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct by @zcnrex in #5103
Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 by @yubofredwang in #4760
Use public model for FA3 speculative decode testing by @yubofredwang in #5152
Add dummy grok test to amd CI. by @saienduri in #5115
fix empty_cache error in pt_weights_iterator by @dangkai4u in #5151
Fix torch compile errors by @kkHuang-amd in #5158
Fix loading KV quantization scale; Enable modelopt kv cache by @yundai424 in #4686
[PD] Fix unclosed prefill connection warning of mini_lb by @ShangmingCai in #5155
Add optimized native kernels in sgl-kernel by @mingfeima in #5150
[PD] Simplify mini LB by @ByronHsu in #4911
Small improvement of native api docs by @simveit in #5139
[feat&refactor] Enhance multimodal input support with refactor io_struct by @JustinTong0323 in #4938
Support 2x8xH100 for Llama 4 by @fzyzcjy in #5159
FP4 weight loading and inference (2/2) by @trevor-m in #3972
Fix multimodal hashing error by @fzyzcjy in #5174
Tiny disable model that does not work by @fzyzcjy in #5175
[Bugfix] Fix index out of bounds in local attention with large sequences by @CatherineSue in #5173
[Fix] DeepEP Compatibility with Low Latency by @liz-badada in #5068
docs: remove the use of Downward API for LWS_WORKER_INDEX by @yankay in #5110
feat: add DeepGEMM build warning by @zhyncs in #5176
fix: use DeepEPDispatcher on CUDA by @zhyncs in #5180
[DeepEP] fix: import buffer error by @ch-wan in #5179
Let bench_one_batch support enable_dp_attention by @fzyzcjy in #4058
[Misc] clean up vllm in sgl-kernel test by @yinfan98 in #5189
Fix ci test "test_eval_fp8_accuracy" failed by @kkHuang-amd in #5185
Optimize topk operation in llama4 by @fzyzcjy in #5128
Support Llama4 fp8 inference by @HandH1998 in #5194
[ci] fix ci test fused_moe op by @BBuf in #5102
model: support mllama4 by @mickqian in #5144
Rework grok test. by @saienduri in #5171
sgl-kernel use cutlass latest version for fp8 blockwise gemm by @yizhang2077 in #5207
Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 by @Muuuchen in #5196
fix: log warning when disable cuda graph by @zhyncs in #5209
[metrics] Add in queue metrics by @hebiao064 in #4444
Fix DeepSeek error when using DeepEP mode by @fzyzcjy in #5190
reduce moe_align_block_size_kernel small batch mode overhead by @BBuf in #5086
[PD] Support KV transfer with mooncake by @stmatengss in #4880
[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool by @stmatengss in #5204
Update deps for mllama4 by @ispobock in #5215
Fix deepseek-v3 with torch.compile in PyTorch 2.6. by @zou3519 in #5213
ROCm sgl-kernel: compatible to later torch by @HaiShaw in #5167
[Misc] Clean sgl-kernel test by @yinfan98 in #5216
Update Makefile / build script to avoid installing incompatible torch dependency by @elfiegg in #5245
Fix torch.compile cacheing by @zou3519 in #5259
ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations by @HaiShaw in #5228
Optimize attention in llama4 by @fzyzcjy in #5127
Optimize GPU memory usage in FlashAttentionBackend's strided indexing by @CatherineSue in #5262
Support --enable-llama4-multimodal by @ch-wan in #5254
[fix] fix mrope positions not picked up by @mickqian in #5265
doc: nested loop code for offline engine by @minleminzui in #5244
fix: examples for token_in_token_out_vlm by @JustinTong0323 in #5193
Fix a 404 link in send_request.ipynb by @windsonsea in #5280
fix: enable fp4 compilation on cu128 by @zhyncs in #5286
feat: add cu128 identifier for sgl-kernel by @zhyncs in #5287
chore: relax the torch version restriction for sgl-kernel compilation by @zhyncs in #5288
chore: bump sgl-kernel v0.0.8.post1 by @zhyncs in #5289
[PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout by @GaoYusong in #5292
[Docs] Supported Model Docs - Major restructuring by @adarshxs in #5290
fix: update update_wheel_index for cu128 by @zhyncs in #5300
[Docs] Remove the older supported docs section by @adarshxs in #5301
remove moe_align_block_size torch.zeros in small batch/expert mode by @BBuf in #5298
feat: add blackwell Dockerfile by @zhyncs in #5302
feat: add blackwell workflow by @zhyncs in #5303
fix: use fa3 unit test on hopper only by @zhyncs in #5304
misc: update blackwell Dockerfile by @zhyncs in #5306
fix: remove cublas_grouped_gemm by @zhyncs in #5307
fix: update flash attn by @zhyncs in #5308
fix: use deepgemm only on hopper by @zhyncs in #5310
[VLM] Adopt fast image processor by default by @mickqian in #5065
Adjust ci test threshold by @ispobock in #5271
Blackwell Cutlass MLA kernel by @trevor-m in #5142
misc: cleanup 3rdparty by @zhyncs in #5311
update variable naming and comments for rocm by @Lzy17 in #5299
Fix w8a8_int8 model shared experts fusion load weights error by @lambert0312 in #5120
Add flash_attn_varlen_func to sgl-kernel by @Fridge003 in #5315
Fix fa3 window size setup by @qingquansong in #5316
chore: bump sgl-kernel v0.0.8.post2 by @zhyncs in #5317
feat: use fa3 mla by default on hopper by @zhyncs in #5210
Fix: docs/backend/structured_outputs.ipynb by @thyecust in #4884
Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… by @BBuf in #5321
refine fused_moe tuning docs by @BBuf in #5294
Support server based rollout in Verlengine by @yitianlian in #4848
[Feat] Add sparse attn to sgl-kernel by @yinfan98 in #5327
fix: solve cu118 issue for cutlass mla by @zhyncs in #5331
chore: bump sgl-kernel v0.0.8.post3 by @zhyncs in #5332
ci: update release node by @zhyncs in #5333
fix: determine if flashinfer is installed by @zhyncs in #5336
feat: adapt merge_state by @zhyncs in #5337
misc: update sagemaker Dockerfile by @zhyncs in #5341
Fix: ensure tensors used in dist.broadcast are created on the correct… by @minleminzui in #5322
docs: update adoption and sponsorship list with Oracle by @zhyncs in #5343
chore: upgrade sgl-kernel 0.0.8.post3 by @zhyncs in #5342
Fix typo: infight -> inflight by @hnyls2002 in #5357
[PD] Add transfer backend abstraction by @ByronHsu in #5328
fix MLATokenToKVPoolHost get_size_per_token bug by @huangtingwei9988 in #5161
fix #5322 by @zhyncs in #5359
feat: update experiment_runner by @zhyncs in #5360
[DeepEP] Reduce routed scaling overhead by @yuleil in #5277
Free metadata_buffer_index after transfer finished by @jokerwyt in #5364
Fix DeepSeek DP Attention + torch compile by @fzyzcjy in #5367
Support for Qwen2.5-VL Model in bitsandbytes Format by @yhyang201 in #5003
Fix PD disaggregation bugs by @hnyls2002 in #5326
[PD Bug] fix MLA get_contiguous_buf_infos error by @whybeyoung in #5384
[perf] experimental enhance fp8 per-tensor quant by @Alcanderian in #5370
Apply deepseek cuda rope by @ispobock in #5385
apply fused moe gate in ds v3/r1 by @BBuf in #5371
fix: update test config by @zhyncs in #5392
[Fix] Turn off DeepGEMM by default by @Fridge003 in #5263
minor clean up of sgl-kernel/CMakeLists.txt by @merrymercy in #5393
Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @lambert0312 in #5368
Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @Ximingwang-09 in #5291
[fix/misc] remove duplicate row in deepseek v2 model by @yyccli in #5279
chore: upgrade DeepGEMM by @zhyncs in #5395
fix: update pr-test-sgl-kernel by @zhyncs in #5399
kernel: support slightly faster merge_state_v2 cuda kernel by @DefTruth in #5381
chore: bump sgl-kernel 0.0.9 by @zhyncs in #5400
chore: upgrade sgl-kernel 0.0.9 by @zhyncs in #5401
Tiny fix DeepseekScalingRotaryEmbedding always use forward_native by @fzyzcjy in #5406
Fix bench_serving with random-ids by @guoyuhong in #5214
[misc] fix ci flaky case by @Alcanderian in #5352
[FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP by @Muuuchen in #5412
Support dynamic connection and TP 16 by @yuan-luo in #5351
Fix broadcast use cuda device lead to memory capacity unbalanced by @lambert0312 in #5416
[PD] Fix dynamic port support and MLA buffer for Mooncake by @ShangmingCai in #5415
Distinguish bootstrap key only in decode server by @hnyls2002 in #5422
[PD] Remove unused bootstrap param and fix port table type by @ShangmingCai in #5423
[minor] cleanup cmakelists.txt by @merrymercy in #5420
bugfix: fix merge_state_v2 cuda graph by @DefTruth in #5419
chore: bump sgl-kernel v0.0.9.post1 by @zhyncs in #5430
fix: solve release issue by @zhyncs in #5434
BLackwell cutlass mla: Add check for bad page size/block num combinations by @trevor-m in #5431
feat: update model_specific_adjustment by @zhyncs in #5344
chore: upgrade sgl-kernel 0.0.9.post1 by @zhyncs in #5436
Fix ignore_eos parameter when loading a chat template by @CatherineSue in #5264
add attention backend supporting matrix in the doc by @mRSun15 in #5211
Support BNB quantization for llama/mllama by @ryang-max in #5038
[Docs] Update start/install.md by @windsonsea in #5398
[Minor] Move torch.compile patch to a better place by @merrymercy in #5397
[Bug fix] need record start time in pd mode by @whybeyoung in #5425
Support MHA with chunked prefix cache for DeepSeek chunked prefill by @Fridge003 in #5113
chore: bump v0.4.5.post1 by @zhyncs in #5445
Fix several minor issues in PD disaggregation by @ch-wan in #5444
[doc] Update benchmark_and_profiling.md by @BBuf in #5449
Update cutlass dependency. by @elfiegg in #5447
add multi-lora feature in README.md by @Ying1123 in #5463
Clean up imports by @merrymercy in #5467
[verl] Modify the update_weights func to align with verl's resharding by @BearBiscuit05 in #5345
[Model Support] unsloth/Phi-4-mini bnb model by @yyihuang in #4982
Update attention_backend.md: plural form by @didier-durand in #5489
Add test for flash_attn_varlen_func kernel by @Fridge003 in #5484
Deprecate disable-mla by @Fridge003 in #5481
Deprecate enable-flashinfer-mla and enable-flashmla by @Fridge003 in #5480
Feat/support encoder model (like bert) by @woodx9 in #4887
Enable local attention during decode by @CatherineSue in #5479
Refactor DeepSeek decoder layer branches by @fzyzcjy in #5205
Fix a link in sgl-kernel/README.md by @windsonsea in #5493
[Bug fix] use correct func path in deepseek by @XucSh in #5496
Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B by @minleminzui in #5503
[Feat] Update sgl-kernel flashinfer to latest main version by @yinfan98 in #5500
Fix: Incorrect parameters passed to forward_batch_generation (#5506) by @u4lr451 in #5511
Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … by @minleminzui in #5426
[docs] Fix several consistency issues in sampling_params.md by @windsonsea in #5373
Configuration qwen2_moe.py - qkv_bias now in transformers by @michaelfeil in #5512
Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 by @fzyzcjy in #4836
Sgl kernel fused_moe_gate support n_shared_experts by @BBuf in #5440
chore: bump sgl-kernel 0.0.9.post2 by @zhyncs in #5518
use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel by @strgrb in #5473
fix kimi vl running bug after rebase main by @BBuf in #5461
fix bug of VLLM_AVAILABLE not defined by @liwenju0 in #5497
Avoid computing lse in Ragged Prefill when there's no prefix. by @Edenzzzz in #5476
[Model] Adding Qwen3 and Qwen3MoE by @yhyang201 in #4693
fix util import by @zhyncs in #5542
Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… by @zhyncs in #5544
chore: upgrade sgl-kernel 0.0.9.post2 by @zhyncs in #5540
Fix DeepGEMM masked cannot be run on groups not being multiple or 4 by @fzyzcjy in #5340
Make profiler output file names consistent by @fzyzcjy in #5548
[PD] Tiny fix timeout error when generate by @fzyzcjy in #5545
[PD] Fix no cache connect for recevier by @whybeyoung in #5534
feat: use flashinfer jit package by @zhyncs in #5547
[PD] Remove the requirement of config file for mooncake backend by @ShangmingCai in #5460
restruct compressed_tensors_w8a8_fp8 by @BBuf in #5475
simplify the control logic for using shared experts fusion by @BBuf in #5504
Remove one kernel in per_tensor_quant_mla_fp8 by @fzyzcjy in #5549
Fix sampler nan check when calling top_k_top_p_sampling_from_probs by @yubofredwang in #5546
[PD] Support page size > 1 by @ByronHsu in #5561
fix hicache write back by @xiezhq-hermann in #5543
Minor update for ROCm variable style by @Lzy17 in #5562
Fix bench_one_batch producing unnatural results for expert parallel by @fzyzcjy in #5149
[perf] introduce deep gemm group_gemm_masked as bmm by @Alcanderian in #5432
[PD] Fix DeepSeek cannot be run on latest master by @fzyzcjy in #5568
Fix BumpAllocator error when no input_ids by @fzyzcjy in #5564
enable DeepSeek V3 shared_experts_fusion in sm90 by @BBuf in #5571
[Fix] fix outlines and xgrammar by @Alcanderian in #4947
[Doc]Add instruction for profiling with bench_one_batch by @Fridge003 in #5581
Release v0.4.5.post2 by @merrymercy in #5582
Fix bench_serving fail when zero warmup requests by @fzyzcjy in #5574
Fix DeepEP cannot run on latest master by @fzyzcjy in #5567
Fix torch memory saver not enabled in DP scenario by @fzyzcjy in #5560
Super tiny fix typo by @fzyzcjy in #5559
Add document for LoRA serving by @Fridge003 in #5521
Tiny improve error message by @fzyzcjy in #5526
[PD] Fix server crash when using batch requests by @fzyzcjy in #5531
[Feat] upgrade pytorch2.6 by @sleepcoo in #5417
Fix enable chunked prefill for Llama4 by @tarinkk in #5575
fix: use fa3 for gemma2 by @zhyncs in #5586
Fix ChatCompletionMessageGenericParam to allow for None content by @Amadeus-Winarto in #5452
[PD] Fix large page size + chunk prefill by @ByronHsu in #5588
Add test config yamls for Deepseek v3 by @Fridge003 in #5433
[Feature] Prefill assistant response - add continue_final_message parameter by @adarshxs in #4226
add function call parser for DeepSeek V3 by @finger92 in #5224
smaller and non gated models for docs by @simveit in #5378
Feat: Implement JSON Mode (response_format.type="json_object") by @kyle-pena-kuzco in #4733
check marlin format before attempting conversion by @qeternity in #4675
compressed_tensors: port w8a16 fp8 from vllm by @vhain in #4852
Fix one more issue reported by torchfix by @b8zhong in #4859
Add sanity check for max_running_requests by @fzyzcjy in #5016
Correct grafana heatmap. by @mac0ne in #5019
Perform Batch Tokenization. by @sundar24295s in #5141
Speedup shared expert weight construction by avoid cloning by @fzyzcjy in #5188
Tiny add Engine.flush_cache API by @fzyzcjy in #5241
[misc] remove is_cuda_available by @Alcanderian in #5319
Fix flush cache by @merrymercy in #5590
Add Speculative Decoding Eagle3 topk > 1 by @qingquansong in #5318
upstream hicache fixes by @xiezhq-hermann in #5570
Tiny add warning when cannot recognize bool env var by @fzyzcjy in #5348
Modify metrics service endpoint by @lambert0312 in #3443
Update protocol.py to fix #4589 by @relic-yuexi in #4590
[Feat.] Enable grafana to show metrics by @PopSoda2002 in #4718
[Fix] Enhance DP Attention for IPv6 Compatibility by @Lucius-THU in #4937
Support o1 model on Azure by @ChuyueSun in #4980
Tiny remove duplicated code by @fzyzcjy in #5021
Tiny update error hint by @fzyzcjy in #5037
Support PD bootstrap fields on /v1/chat/completions endpoint by @jokerwyt in #5488
[PD] Fix generate endpoint of min_lb for PD by @ShangmingCai in #5598
[PD] Fix edge case and simplify large page size + chunked prefill by @ByronHsu in #5589
[PD] Add NIXL transfer backend by @trevor-m in #5477
[PD] Support decode overlap schedule by @ByronHsu in #5608
[PD] Support prefill overlap + Ensure no race condition by @ByronHsu in #5609
Enhance GPU memory settings by @hnyls2002 in #5604
[feature] enable pre compile jit deep_gemm by @Alcanderian in #5580
Clean up mem settings by @merrymercy in #5610
Support aiter RMSNorm in AMD by @michael-amd in #5510
chore: bump v0.4.5.post3 by @zhyncs in #5611
Remove extra copy in deepseek forward absorb by @ispobock in #5578
[Doc] Fix a 404 link to llama-405b by @windsonsea in #5615
[fix] force use deepgemm in compile_deep_gemm by @Alcanderian in #5618
[fix] fix compile_deep_gemm missing kv_b_proj by @Alcanderian in #5620
fix: gemma 3 not use softcap by @zhyncs in #5622
Fix FA3 DeepSeek prefill performance regression by @Alcanderian in #5624
[NFC] Remove duplicate compressed-tensors by @c8ef in #5640
Fix shared experts fusion error without quantization by @lambert0312 in #5632
[feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 by @saltyfish66 in #5641
fix flashmla bug by @sleepcoo in #5272
[fix] reduce dp capture bs by @Alcanderian in #5634
Remove q concat in FA3 backend for DeepSeek decode by @ispobock in #5638
Revert "Support aiter RMSNorm in AMD" by @HaiShaw in #5646
fix: update bench_speculative by @zhyncs in #5649
Turn on DeepGemm By Default and Update Doc by @Fridge003 in #5628
Fuse q_a_proj and kv_a_proj for DeepSeek models by @Fridge003 in #5619
Remove unnecessary torch.full in DeepSeek by @fzyzcjy in #5601
[1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell by @elfiegg in #5281
fix sgl-kernel unit tests by @zhyncs in #5666
fix awq_dequantize import by @zhyncs in #5669
Integrating PD disaggregation with DP attention and DeepEP by @ch-wan in #5435
fix gemma3 unit test by @zhyncs in #5670
fix torchvision::nms not exist by @zhyncs in #5671
[PD] Add support for dp attention with mooncake by @ShangmingCai in #5530
tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py by @merrymercy in #5677
[Doc] Fix two 404 links caused by sglang typo by @windsonsea in #5667
fix: update truss bench_serving by @zhyncs in #5683
fix: only compile ApplyTokenBitmaskInplace cu124+ by @zhyncs in #5686
chore: bump sgl-kernel 0.1.0 by @zhyncs in #5688
vlm: enable radix cache for qwen-vl models by @mickqian in #5349
[BugFix] Fix combination of MTP and --n-share-experts-fusionwith R1 by @guoyuhong in #5707
Fix weight loading bug for Deepseek v3+nextn by @Fridge003 in #5684
Add example to use sgl engine with fastapi by @ravi03071991 in #5648
[Doc] Fix a link to Weilin Zhao by @windsonsea in #5706
Add MMMU benchmark results by @ravi03071991 in #4491
[Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct) by @b8zhong in #5078
[PD] Better logs by @hnyls2002 in #5715
[PD] Add kvargs table and thread pool for kvcache sender of mooncake by @ShangmingCai in #5738
[PD]: Support Muti Prefill in one node by @hcyz33 in #5704
Fix: deepseek forward absorb by @michael-amd in #5723
Pin torch audio to 2.6.0 by @merrymercy in #5750
Revert "[Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct)" by @merrymercy in #5754
Disable flaky eagle tests by @merrymercy in #5753
update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. by @BBuf in #5740
[Docs] Update runtime/engine/readme.md by @windsonsea in #5737
Reorder loop in shared expert weight loading by @ispobock in #5719
fix: fix one more bug from merging mm_inputs by @mickqian in #5718
[Fix]: support deepseek-vl2-tiny model by @bppps in #5552
Bugfix for minicpmo vision test by @yizhang2077 in #5760
[Minor] fix documentations by @merrymercy in #5756
Add an assertion to enhance the robustness of the operator by @liwenju0 in #5736
fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 by @lkm2835 in #5733
Use device_id in dist init to reduce NCCL communicator warmup & creation overhead by @Edenzzzz in #5728
[fix] fix potential bumpy throughtput with deepgemm by @Alcanderian in #5722
Resolves the 404 Not Found error when running compile_deep_gemm.py in multi-node setups by @guoyuhong in #5720
perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling by @saltyfish66 in #5716
we fix the non existent access of decrypted_config_file by @vincentzed in #5685
CI: rewrite test_vision_chunked_prefill to speedup by @mickqian in #5682
Fuse MLA set kv cache kernel by @ispobock in #5748
Update amd docker image to sglang:v0.4.5.post3-rocm630. by @saienduri in #5697
[feature] support for roberta embedding models by @DavidBao03 in #5730
[fix] fix bench_one_batch_server by @Alcanderian in #5607
support for the DeepSeek model by enabling streaming response parsing by @Frank-Jie in #5592
fix: Use is not None instead of != None for None checks. by @vincentzed in #5687
Add Llama 4 to FA3 test by @hebiao064 in #5509
[misc] more decode step log for batch_one_batch by @Alcanderian in #5565
Handle JSONDecodeError while processing request data by @yan97ao in #5599
fix(srt): check if sample_indices is not None before usage. by @aoshen524 in #5633
update llguidance to 0.7.11; adds StructTag by @mmoskal in #4870
Use sgl-kernel sgl_per_token_group_quant_int8 by @lambert0312 in #4971
Add memory_saver check by @kebe7jun in #4986
add switch to disable open api doc by @congcongke in #3744
Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" by @merrymercy in #5772
Fix eagle test case by @merrymercy in #5776
Split local attention test from fa3 test by @Fridge003 in #5774
Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" by @merrymercy in #5777
Simplify FA3 tests by @merrymercy in #5779
Revert "[fix] fix bench_one_batch_server" by @merrymercy in #5785
Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" by @merrymercy in #5786
[CI] Tune threshold by @merrymercy in #5787
[CI] fix port conflicts by @merrymercy in #5789
[CI] Fix ci tests by @merrymercy in #5769
[PD]Reduce kv transfer threads by @hnyls2002 in #5791
[CI] Fix test case by @merrymercy in #5790
Add 8-GPU Test for Deepseek-V3 by @Fridge003 in #5691
Release v0.4.6 by @Fridge003 in #5795

New Contributors

@huangtingwei9988 made their first contribution in #5083
@yubofredwang made their first contribution in #4760
@dangkai4u made their first contribution in #5151
@ShangmingCai made their first contribution in #5155
@mingfeima made their first contribution in #5150
@yankay made their first contribution in #5110
@Muuuchen made their first contribution in #5196
@stmatengss made their first contribution in #4880
@zou3519 made their first contribution in #5213
@GaoYusong made their first contribution in #5292
@Lzy17 made their first contribution in #5299
@thyecust made their first contribution in #4884
@yitianlian made their first contribution in #4848
@yuleil made their first contribution in #5277
@jokerwyt made their first contribution in #5364
@yhyang201 made their first contribution in #5003
@yyccli made their first contribution in #5279
@DefTruth made their first contribution in #5381
@yuan-luo made their first contribution in #5351
@mRSun15 made their first contribution in #5211
@ryang-max made their first contribution in #5038
@BearBiscuit05 made their first contribution in #5345
@yyihuang made their first contribution in #4982
@u4lr451 made their first contribution in #5511
@liwenju0 made their first contribution in #5497
@Amadeus-Winarto made their first contribution in #5452
@finger92 made their first contribution in #5224
@kyle-pena-kuzco made their first contribution in #4733
@mac0ne made their first contribution in #5019
@sundar24295s made their first contribution in #5141
@relic-yuexi made their first contribution in #4590
@PopSoda2002 made their first contribution in #4718
@Lucius-THU made their first contribution in #4937
@michael-amd made their first contribution in #5510
@c8ef made their first contribution in #5640
@bppps made their first contribution in #5552
@vincentzed made their first contribution in #5685
@DavidBao03 made their first contribution in #5730
@Frank-Jie made their first contribution in #5592
@yan97ao made their first contribution in #5599
@mmoskal made their first contribution in #4870
@congcongke made their first contribution in #3744

Full Changelog: v0.4.5...v0.4.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.4.6

Highlights

Coming Soon

What's Changed

New Contributors

Contributors

Uh oh!