Highlights
- Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, QWen, Llama, etc). #4709 (comment)
- PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
- DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
- Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
- Preliminary support for blackwell #5303
Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
Coming Soon
- Large scale expert parallelism + PD disaggregation #4734 #5524
- Pipeline Parallelism #5724
- MLA Cutlass Backend #5390
What's Changed
- [ci] fix llama4 ci error by @BBuf in #5126
- Refactor and Optimize FA3 Code by @hebiao064 in #5090
- Add Llama4 user guide by @ispobock in #5133
- [Misc] Use pytest.mark.skipif in sgl-kernel test by @yinfan98 in #5137
- feat: disable grammar restrictions within reasoning sections by @minleminzui in #4984
- [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method by @yundai424 in #5145
- [AMD] Fix missing per_token_group_quant_fp8 for ROCm by @hubertlu-tw in #5140
- fix multimodal hash feature by @huangtingwei9988 in #5083
- Fix run time error in ROCm platform by @kkHuang-amd in #5147
- [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct by @zcnrex in #5103
- Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 by @yubofredwang in #4760
- Use public model for FA3 speculative decode testing by @yubofredwang in #5152
- Add dummy grok test to amd CI. by @saienduri in #5115
- fix empty_cache error in pt_weights_iterator by @dangkai4u in #5151
- Fix torch compile errors by @kkHuang-amd in #5158
- Fix loading KV quantization scale; Enable modelopt kv cache by @yundai424 in #4686
- [PD] Fix unclosed prefill connection warning of mini_lb by @ShangmingCai in #5155
- Add optimized native kernels in sgl-kernel by @mingfeima in #5150
- [PD] Simplify mini LB by @ByronHsu in #4911
- Small improvement of native api docs by @simveit in #5139
- [feat&refactor] Enhance multimodal input support with refactor io_struct by @JustinTong0323 in #4938
- Support 2x8xH100 for Llama 4 by @fzyzcjy in #5159
- FP4 weight loading and inference (2/2) by @trevor-m in #3972
- Fix multimodal hashing error by @fzyzcjy in #5174
- Tiny disable model that does not work by @fzyzcjy in #5175
- [Bugfix] Fix index out of bounds in local attention with large sequences by @CatherineSue in #5173
- [Fix] DeepEP Compatibility with Low Latency by @liz-badada in #5068
- docs: remove the use of Downward API for LWS_WORKER_INDEX by @yankay in #5110
- feat: add DeepGEMM build warning by @zhyncs in #5176
- fix: use DeepEPDispatcher on CUDA by @zhyncs in #5180
- [DeepEP] fix: import buffer error by @ch-wan in #5179
- Let
bench_one_batch
supportenable_dp_attention
by @fzyzcjy in #4058 - [Misc] clean up vllm in sgl-kernel test by @yinfan98 in #5189
- Fix ci test "test_eval_fp8_accuracy" failed by @kkHuang-amd in #5185
- Optimize topk operation in llama4 by @fzyzcjy in #5128
- Support Llama4 fp8 inference by @HandH1998 in #5194
- [ci] fix ci test fused_moe op by @BBuf in #5102
- model: support mllama4 by @mickqian in #5144
- Rework grok test. by @saienduri in #5171
- sgl-kernel use cutlass latest version for fp8 blockwise gemm by @yizhang2077 in #5207
- Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 by @Muuuchen in #5196
- fix: log warning when disable cuda graph by @zhyncs in #5209
- [metrics] Add in queue metrics by @hebiao064 in #4444
- Fix DeepSeek error when using DeepEP mode by @fzyzcjy in #5190
- reduce moe_align_block_size_kernel small batch mode overhead by @BBuf in #5086
- [PD] Support KV transfer with mooncake by @stmatengss in #4880
- [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool by @stmatengss in #5204
- Update deps for mllama4 by @ispobock in #5215
- Fix deepseek-v3 with torch.compile in PyTorch 2.6. by @zou3519 in #5213
- ROCm sgl-kernel: compatible to later torch by @HaiShaw in #5167
- [Misc] Clean sgl-kernel test by @yinfan98 in #5216
- Update Makefile / build script to avoid installing incompatible torch dependency by @elfiegg in #5245
- Fix torch.compile cacheing by @zou3519 in #5259
- ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations by @HaiShaw in #5228
- Optimize attention in llama4 by @fzyzcjy in #5127
- Optimize GPU memory usage in FlashAttentionBackend's strided indexing by @CatherineSue in #5262
- Support
--enable-llama4-multimodal
by @ch-wan in #5254 - [fix] fix mrope positions not picked up by @mickqian in #5265
- doc: nested loop code for offline engine by @minleminzui in #5244
- fix: examples for token_in_token_out_vlm by @JustinTong0323 in #5193
- Fix a 404 link in send_request.ipynb by @windsonsea in #5280
- fix: enable fp4 compilation on cu128 by @zhyncs in #5286
- feat: add cu128 identifier for sgl-kernel by @zhyncs in #5287
- chore: relax the torch version restriction for sgl-kernel compilation by @zhyncs in #5288
- chore: bump sgl-kernel v0.0.8.post1 by @zhyncs in #5289
- [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout by @GaoYusong in #5292
- [Docs] Supported Model Docs - Major restructuring by @adarshxs in #5290
- fix: update update_wheel_index for cu128 by @zhyncs in #5300
- [Docs] Remove the older supported docs section by @adarshxs in #5301
- remove moe_align_block_size torch.zeros in small batch/expert mode by @BBuf in #5298
- feat: add blackwell Dockerfile by @zhyncs in #5302
- feat: add blackwell workflow by @zhyncs in #5303
- fix: use fa3 unit test on hopper only by @zhyncs in #5304
- misc: update blackwell Dockerfile by @zhyncs in #5306
- fix: remove cublas_grouped_gemm by @zhyncs in #5307
- fix: update flash attn by @zhyncs in #5308
- fix: use deepgemm only on hopper by @zhyncs in #5310
- [VLM] Adopt fast image processor by default by @mickqian in #5065
- Adjust ci test threshold by @ispobock in #5271
- Blackwell Cutlass MLA kernel by @trevor-m in #5142
- misc: cleanup 3rdparty by @zhyncs in #5311
- update variable naming and comments for rocm by @Lzy17 in #5299
- Fix w8a8_int8 model shared experts fusion load weights error by @lambert0312 in #5120
- Add flash_attn_varlen_func to sgl-kernel by @Fridge003 in #5315
- Fix fa3 window size setup by @qingquansong in #5316
- chore: bump sgl-kernel v0.0.8.post2 by @zhyncs in #5317
- feat: use fa3 mla by default on hopper by @zhyncs in #5210
- Fix: docs/backend/structured_outputs.ipynb by @thyecust in #4884
- Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… by @BBuf in #5321
- refine fused_moe tuning docs by @BBuf in #5294
- Support server based rollout in Verlengine by @yitianlian in #4848
- [Feat] Add sparse attn to sgl-kernel by @yinfan98 in #5327
- fix: solve cu118 issue for cutlass mla by @zhyncs in #5331
- chore: bump sgl-kernel v0.0.8.post3 by @zhyncs in #5332
- ci: update release node by @zhyncs in #5333
- fix: determine if flashinfer is installed by @zhyncs in #5336
- feat: adapt merge_state by @zhyncs in #5337
- misc: update sagemaker Dockerfile by @zhyncs in #5341
- Fix: ensure tensors used in dist.broadcast are created on the correct… by @minleminzui in #5322
- docs: update adoption and sponsorship list with Oracle by @zhyncs in #5343
- chore: upgrade sgl-kernel 0.0.8.post3 by @zhyncs in #5342
- Fix typo: infight -> inflight by @hnyls2002 in #5357
- [PD] Add transfer backend abstraction by @ByronHsu in #5328
- fix MLATokenToKVPoolHost get_size_per_token bug by @huangtingwei9988 in #5161
- fix #5322 by @zhyncs in #5359
- feat: update experiment_runner by @zhyncs in #5360
- [DeepEP] Reduce routed scaling overhead by @yuleil in #5277
- Free metadata_buffer_index after transfer finished by @jokerwyt in #5364
- Fix DeepSeek DP Attention + torch compile by @fzyzcjy in #5367
- Support for Qwen2.5-VL Model in bitsandbytes Format by @yhyang201 in #5003
- Fix PD disaggregation bugs by @hnyls2002 in #5326
- [PD Bug] fix MLA get_contiguous_buf_infos error by @whybeyoung in #5384
- [perf] experimental enhance fp8 per-tensor quant by @Alcanderian in #5370
- Apply deepseek cuda rope by @ispobock in #5385
- apply fused moe gate in ds v3/r1 by @BBuf in #5371
- fix: update test config by @zhyncs in #5392
- [Fix] Turn off DeepGEMM by default by @Fridge003 in #5263
- minor clean up of sgl-kernel/CMakeLists.txt by @merrymercy in #5393
- Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @lambert0312 in #5368
- Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @Ximingwang-09 in #5291
- [fix/misc] remove duplicate row in deepseek v2 model by @yyccli in #5279
- chore: upgrade DeepGEMM by @zhyncs in #5395
- fix: update pr-test-sgl-kernel by @zhyncs in #5399
- kernel: support slightly faster merge_state_v2 cuda kernel by @DefTruth in #5381
- chore: bump sgl-kernel 0.0.9 by @zhyncs in #5400
- chore: upgrade sgl-kernel 0.0.9 by @zhyncs in #5401
- Tiny fix DeepseekScalingRotaryEmbedding always use forward_native by @fzyzcjy in #5406
- Fix bench_serving with random-ids by @guoyuhong in #5214
- [misc] fix ci flaky case by @Alcanderian in #5352
- [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP by @Muuuchen in #5412
- Support dynamic connection and TP 16 by @yuan-luo in #5351
- Fix broadcast use cuda device lead to memory capacity unbalanced by @lambert0312 in #5416
- [PD] Fix dynamic port support and MLA buffer for Mooncake by @ShangmingCai in #5415
- Distinguish bootstrap key only in decode server by @hnyls2002 in #5422
- [PD] Remove unused bootstrap param and fix port table type by @ShangmingCai in #5423
- [minor] cleanup cmakelists.txt by @merrymercy in #5420
- bugfix: fix merge_state_v2 cuda graph by @DefTruth in #5419
- chore: bump sgl-kernel v0.0.9.post1 by @zhyncs in #5430
- fix: solve release issue by @zhyncs in #5434
- BLackwell cutlass mla: Add check for bad page size/block num combinations by @trevor-m in #5431
- feat: update model_specific_adjustment by @zhyncs in #5344
- chore: upgrade sgl-kernel 0.0.9.post1 by @zhyncs in #5436
- Fix ignore_eos parameter when loading a chat template by @CatherineSue in #5264
- add attention backend supporting matrix in the doc by @mRSun15 in #5211
- Support BNB quantization for llama/mllama by @ryang-max in #5038
- [Docs] Update start/install.md by @windsonsea in #5398
- [Minor] Move torch.compile patch to a better place by @merrymercy in #5397
- [Bug fix] need record start time in pd mode by @whybeyoung in #5425
- Support MHA with chunked prefix cache for DeepSeek chunked prefill by @Fridge003 in #5113
- chore: bump v0.4.5.post1 by @zhyncs in #5445
- Fix several minor issues in PD disaggregation by @ch-wan in #5444
- [doc] Update benchmark_and_profiling.md by @BBuf in #5449
- Update cutlass dependency. by @elfiegg in #5447
- add multi-lora feature in README.md by @Ying1123 in #5463
- Clean up imports by @merrymercy in #5467
- [verl] Modify the update_weights func to align with verl's resharding by @BearBiscuit05 in #5345
- [Model Support] unsloth/Phi-4-mini bnb model by @yyihuang in #4982
- Update attention_backend.md: plural form by @didier-durand in #5489
- Add test for flash_attn_varlen_func kernel by @Fridge003 in #5484
- Deprecate disable-mla by @Fridge003 in #5481
- Deprecate enable-flashinfer-mla and enable-flashmla by @Fridge003 in #5480
- Feat/support encoder model (like bert) by @woodx9 in #4887
- Enable local attention during decode by @CatherineSue in #5479
- Refactor DeepSeek decoder layer branches by @fzyzcjy in #5205
- Fix a link in sgl-kernel/README.md by @windsonsea in #5493
- [Bug fix] use correct func path in deepseek by @XucSh in #5496
- Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B by @minleminzui in #5503
- [Feat] Update sgl-kernel flashinfer to latest main version by @yinfan98 in #5500
- Fix: Incorrect parameters passed to forward_batch_generation (#5506) by @u4lr451 in #5511
- Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … by @minleminzui in #5426
- [docs] Fix several consistency issues in sampling_params.md by @windsonsea in #5373
- Configuration qwen2_moe.py - qkv_bias now in transformers by @michaelfeil in #5512
- Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 by @fzyzcjy in #4836
- Sgl kernel fused_moe_gate support n_shared_experts by @BBuf in #5440
- chore: bump sgl-kernel 0.0.9.post2 by @zhyncs in #5518
- use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel by @strgrb in #5473
- fix kimi vl running bug after rebase main by @BBuf in #5461
- fix bug of VLLM_AVAILABLE not defined by @liwenju0 in #5497
- Avoid computing lse in Ragged Prefill when there's no prefix. by @Edenzzzz in #5476
- [Model] Adding Qwen3 and Qwen3MoE by @yhyang201 in #4693
- fix util import by @zhyncs in #5542
- Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… by @zhyncs in #5544
- chore: upgrade sgl-kernel 0.0.9.post2 by @zhyncs in #5540
- Fix DeepGEMM masked cannot be run on groups not being multiple or 4 by @fzyzcjy in #5340
- Make profiler output file names consistent by @fzyzcjy in #5548
- [PD] Tiny fix timeout error when generate by @fzyzcjy in #5545
- [PD] Fix no cache connect for recevier by @whybeyoung in #5534
- feat: use flashinfer jit package by @zhyncs in #5547
- [PD] Remove the requirement of config file for mooncake backend by @ShangmingCai in #5460
- restruct compressed_tensors_w8a8_fp8 by @BBuf in #5475
- simplify the control logic for using shared experts fusion by @BBuf in #5504
- Remove one kernel in per_tensor_quant_mla_fp8 by @fzyzcjy in #5549
- Fix sampler nan check when calling top_k_top_p_sampling_from_probs by @yubofredwang in #5546
- [PD] Support page size > 1 by @ByronHsu in #5561
- fix hicache write back by @xiezhq-hermann in #5543
- Minor update for ROCm variable style by @Lzy17 in #5562
- Fix bench_one_batch producing unnatural results for expert parallel by @fzyzcjy in #5149
- [perf] introduce deep gemm group_gemm_masked as bmm by @Alcanderian in #5432
- [PD] Fix DeepSeek cannot be run on latest master by @fzyzcjy in #5568
- Fix BumpAllocator error when no input_ids by @fzyzcjy in #5564
- enable DeepSeek V3 shared_experts_fusion in sm90 by @BBuf in #5571
- [Fix] fix outlines and xgrammar by @Alcanderian in #4947
- [Doc]Add instruction for profiling with bench_one_batch by @Fridge003 in #5581
- Release v0.4.5.post2 by @merrymercy in #5582
- Fix bench_serving fail when zero warmup requests by @fzyzcjy in #5574
- Fix DeepEP cannot run on latest master by @fzyzcjy in #5567
- Fix torch memory saver not enabled in DP scenario by @fzyzcjy in #5560
- Super tiny fix typo by @fzyzcjy in #5559
- Add document for LoRA serving by @Fridge003 in #5521
- Tiny improve error message by @fzyzcjy in #5526
- [PD] Fix server crash when using batch requests by @fzyzcjy in #5531
- [Feat] upgrade pytorch2.6 by @sleepcoo in #5417
- Fix enable chunked prefill for Llama4 by @tarinkk in #5575
- fix: use fa3 for gemma2 by @zhyncs in #5586
- Fix ChatCompletionMessageGenericParam to allow for None content by @Amadeus-Winarto in #5452
- [PD] Fix large page size + chunk prefill by @ByronHsu in #5588
- Add test config yamls for Deepseek v3 by @Fridge003 in #5433
- [Feature] Prefill assistant response - add continue_final_message parameter by @adarshxs in #4226
- add function call parser for DeepSeek V3 by @finger92 in #5224
- smaller and non gated models for docs by @simveit in #5378
- Feat: Implement JSON Mode (response_format.type="json_object") by @kyle-pena-kuzco in #4733
- check marlin format before attempting conversion by @qeternity in #4675
- compressed_tensors: port w8a16 fp8 from vllm by @vhain in #4852
- Fix one more issue reported by torchfix by @b8zhong in #4859
- Add sanity check for max_running_requests by @fzyzcjy in #5016
- Correct grafana heatmap. by @mac0ne in #5019
- Perform Batch Tokenization. by @sundar24295s in #5141
- Speedup shared expert weight construction by avoid cloning by @fzyzcjy in #5188
- Tiny add Engine.flush_cache API by @fzyzcjy in #5241
- [misc] remove is_cuda_available by @Alcanderian in #5319
- Fix flush cache by @merrymercy in #5590
- Add Speculative Decoding Eagle3 topk > 1 by @qingquansong in #5318
- upstream hicache fixes by @xiezhq-hermann in #5570
- Tiny add warning when cannot recognize bool env var by @fzyzcjy in #5348
- Modify metrics service endpoint by @lambert0312 in #3443
- Update protocol.py to fix #4589 by @relic-yuexi in #4590
- [Feat.] Enable grafana to show metrics by @PopSoda2002 in #4718
- [Fix] Enhance DP Attention for IPv6 Compatibility by @Lucius-THU in #4937
- Support o1 model on Azure by @ChuyueSun in #4980
- Tiny remove duplicated code by @fzyzcjy in #5021
- Tiny update error hint by @fzyzcjy in #5037
- Support PD bootstrap fields on /v1/chat/completions endpoint by @jokerwyt in #5488
- [PD] Fix generate endpoint of min_lb for PD by @ShangmingCai in #5598
- [PD] Fix edge case and simplify large page size + chunked prefill by @ByronHsu in #5589
- [PD] Add NIXL transfer backend by @trevor-m in #5477
- [PD] Support decode overlap schedule by @ByronHsu in #5608
- [PD] Support prefill overlap + Ensure no race condition by @ByronHsu in #5609
- Enhance GPU memory settings by @hnyls2002 in #5604
- [feature] enable pre compile jit deep_gemm by @Alcanderian in #5580
- Clean up mem settings by @merrymercy in #5610
- Support aiter RMSNorm in AMD by @michael-amd in #5510
- chore: bump v0.4.5.post3 by @zhyncs in #5611
- Remove extra copy in deepseek forward absorb by @ispobock in #5578
- [Doc] Fix a 404 link to llama-405b by @windsonsea in #5615
- [fix] force use deepgemm in compile_deep_gemm by @Alcanderian in #5618
- [fix] fix compile_deep_gemm missing kv_b_proj by @Alcanderian in #5620
- fix: gemma 3 not use softcap by @zhyncs in #5622
- Fix FA3 DeepSeek prefill performance regression by @Alcanderian in #5624
- [NFC] Remove duplicate
compressed-tensors
by @c8ef in #5640 - Fix shared experts fusion error without quantization by @lambert0312 in #5632
- [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 by @saltyfish66 in #5641
- fix flashmla bug by @sleepcoo in #5272
- [fix] reduce dp capture bs by @Alcanderian in #5634
- Remove q concat in FA3 backend for DeepSeek decode by @ispobock in #5638
- Revert "Support aiter RMSNorm in AMD" by @HaiShaw in #5646
- fix: update bench_speculative by @zhyncs in #5649
- Turn on DeepGemm By Default and Update Doc by @Fridge003 in #5628
- Fuse q_a_proj and kv_a_proj for DeepSeek models by @Fridge003 in #5619
- Remove unnecessary
torch.full
in DeepSeek by @fzyzcjy in #5601 - [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell by @elfiegg in #5281
- fix sgl-kernel unit tests by @zhyncs in #5666
- fix awq_dequantize import by @zhyncs in #5669
- Integrating PD disaggregation with DP attention and DeepEP by @ch-wan in #5435
- fix gemma3 unit test by @zhyncs in #5670
- fix torchvision::nms not exist by @zhyncs in #5671
- [PD] Add support for dp attention with mooncake by @ShangmingCai in #5530
- tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py by @merrymercy in #5677
- [Doc] Fix two 404 links caused by sglang typo by @windsonsea in #5667
- fix: update truss bench_serving by @zhyncs in #5683
- fix: only compile ApplyTokenBitmaskInplace cu124+ by @zhyncs in #5686
- chore: bump sgl-kernel 0.1.0 by @zhyncs in #5688
- vlm: enable radix cache for qwen-vl models by @mickqian in #5349
- [BugFix] Fix combination of MTP and
--n-share-experts-fusion
with R1 by @guoyuhong in #5707 - Fix weight loading bug for Deepseek v3+nextn by @Fridge003 in #5684
- Add example to use sgl engine with fastapi by @ravi03071991 in #5648
- [Doc] Fix a link to Weilin Zhao by @windsonsea in #5706
- Add MMMU benchmark results by @ravi03071991 in #4491
- [Model] Support
ArcticForCausalLM
architecture (Snowflake/snowflake-arctic-instruct) by @b8zhong in #5078 - [PD] Better logs by @hnyls2002 in #5715
- [PD] Add kvargs table and thread pool for kvcache sender of mooncake by @ShangmingCai in #5738
- [PD]: Support Muti Prefill in one node by @hcyz33 in #5704
- Fix: deepseek forward absorb by @michael-amd in #5723
- Pin torch audio to 2.6.0 by @merrymercy in #5750
- Revert "[Model] Support
ArcticForCausalLM
architecture (Snowflake/snowflake-arctic-instruct)" by @merrymercy in #5754 - Disable flaky eagle tests by @merrymercy in #5753
- update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. by @BBuf in #5740
- [Docs] Update runtime/engine/readme.md by @windsonsea in #5737
- Reorder loop in shared expert weight loading by @ispobock in #5719
- fix: fix one more bug from merging mm_inputs by @mickqian in #5718
- [Fix]: support deepseek-vl2-tiny model by @bppps in #5552
- Bugfix for minicpmo vision test by @yizhang2077 in #5760
- [Minor] fix documentations by @merrymercy in #5756
- Add an assertion to enhance the robustness of the operator by @liwenju0 in #5736
- fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 by @lkm2835 in #5733
- Use device_id in dist init to reduce NCCL communicator warmup & creation overhead by @Edenzzzz in #5728
- [fix] fix potential bumpy throughtput with deepgemm by @Alcanderian in #5722
- Resolves the
404 Not Found
error when runningcompile_deep_gemm.py
in multi-node setups by @guoyuhong in #5720 - perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling by @saltyfish66 in #5716
- we fix the non existent access of
decrypted_config_file
by @vincentzed in #5685 - CI: rewrite test_vision_chunked_prefill to speedup by @mickqian in #5682
- Fuse MLA set kv cache kernel by @ispobock in #5748
- Update amd docker image to
sglang:v0.4.5.post3-rocm630
. by @saienduri in #5697 - [feature] support for roberta embedding models by @DavidBao03 in #5730
- [fix] fix bench_one_batch_server by @Alcanderian in #5607
- support for the DeepSeek model by enabling streaming response parsing by @Frank-Jie in #5592
- fix: Use
is not None
instead of!= None
for None checks. by @vincentzed in #5687 - Add Llama 4 to FA3 test by @hebiao064 in #5509
- [misc] more decode step log for batch_one_batch by @Alcanderian in #5565
- Handle JSONDecodeError while processing request data by @yan97ao in #5599
- fix(srt): check if sample_indices is not None before usage. by @aoshen524 in #5633
- update llguidance to 0.7.11; adds StructTag by @mmoskal in #4870
- Use sgl-kernel sgl_per_token_group_quant_int8 by @lambert0312 in #4971
- Add memory_saver check by @kebe7jun in #4986
- add switch to disable open api doc by @congcongke in #3744
- Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" by @merrymercy in #5772
- Fix eagle test case by @merrymercy in #5776
- Split local attention test from fa3 test by @Fridge003 in #5774
- Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" by @merrymercy in #5777
- Simplify FA3 tests by @merrymercy in #5779
- Revert "[fix] fix bench_one_batch_server" by @merrymercy in #5785
- Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" by @merrymercy in #5786
- [CI] Tune threshold by @merrymercy in #5787
- [CI] fix port conflicts by @merrymercy in #5789
- [CI] Fix ci tests by @merrymercy in #5769
- [PD]Reduce kv transfer threads by @hnyls2002 in #5791
- [CI] Fix test case by @merrymercy in #5790
- Add 8-GPU Test for Deepseek-V3 by @Fridge003 in #5691
- Release v0.4.6 by @Fridge003 in #5795
New Contributors
- @huangtingwei9988 made their first contribution in #5083
- @yubofredwang made their first contribution in #4760
- @dangkai4u made their first contribution in #5151
- @ShangmingCai made their first contribution in #5155
- @mingfeima made their first contribution in #5150
- @yankay made their first contribution in #5110
- @Muuuchen made their first contribution in #5196
- @stmatengss made their first contribution in #4880
- @zou3519 made their first contribution in #5213
- @GaoYusong made their first contribution in #5292
- @Lzy17 made their first contribution in #5299
- @thyecust made their first contribution in #4884
- @yitianlian made their first contribution in #4848
- @yuleil made their first contribution in #5277
- @jokerwyt made their first contribution in #5364
- @yhyang201 made their first contribution in #5003
- @yyccli made their first contribution in #5279
- @DefTruth made their first contribution in #5381
- @yuan-luo made their first contribution in #5351
- @mRSun15 made their first contribution in #5211
- @ryang-max made their first contribution in #5038
- @BearBiscuit05 made their first contribution in #5345
- @yyihuang made their first contribution in #4982
- @u4lr451 made their first contribution in #5511
- @liwenju0 made their first contribution in #5497
- @Amadeus-Winarto made their first contribution in #5452
- @finger92 made their first contribution in #5224
- @kyle-pena-kuzco made their first contribution in #4733
- @mac0ne made their first contribution in #5019
- @sundar24295s made their first contribution in #5141
- @relic-yuexi made their first contribution in #4590
- @PopSoda2002 made their first contribution in #4718
- @Lucius-THU made their first contribution in #4937
- @michael-amd made their first contribution in #5510
- @c8ef made their first contribution in #5640
- @bppps made their first contribution in #5552
- @vincentzed made their first contribution in #5685
- @DavidBao03 made their first contribution in #5730
- @Frank-Jie made their first contribution in #5592
- @yan97ao made their first contribution in #5599
- @mmoskal made their first contribution in #4870
- @congcongke made their first contribution in #3744
Full Changelog: v0.4.5...v0.4.6