Skip to content

Release v0.4.6

Latest
Compare
Choose a tag to compare
@Fridge003 Fridge003 released this 27 Apr 21:47
· 493 commits to main since this release
84022c0

Highlights

  • Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, QWen, Llama, etc). #4709 (comment)
  • PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
  • DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
  • Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
  • Preliminary support for blackwell #5303

Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

Coming Soon

  • Large scale expert parallelism + PD disaggregation #4734 #5524
  • Pipeline Parallelism #5724
  • MLA Cutlass Backend #5390

What's Changed

New Contributors

Full Changelog: v0.4.5...v0.4.6