Development Roadmap (2025 H1) #4042

zhyncs · 2025-03-04T00:09:49Z

artetaout · 2025-03-04T03:18:41Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Swipe4057 · 2025-03-04T04:31:11Z

As part of long context optimization, the implementation of HiP #3930 attention will be considered?

zhaochenyang20 · 2025-03-04T08:08:11Z

@Swipe4057 Thanks. We will review this and merge it

Zhuohao-Li · 2025-03-04T08:32:07Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

SandroPats · 2025-03-11T12:44:57Z

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

zhyncs · 2025-03-11T21:44:09Z

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

Hi @SandroPats Please join https://slack.sglang.ai and discuss at #quantization Thanks!

artetaout · 2025-03-13T03:56:13Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

Zhuohao-Li · 2025-03-13T04:51:44Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

In TE, if you need to enable tp overlap only in inference, you need to split sequences manually (SP/TP). I guess that's perhaps the reason you did not see improvement. You can join https://slack.sglang.ai/ and find me to discuss further

catqaq · 2025-04-01T14:41:27Z

Reward server stability: In large-scale reinforcement learning systems, the reward server must maintain a high level of stability, including capabilities such as load balancing, rate limiting, and long-text processing. While these are not strictly algorithmic requirements, they are critically important.

zhaochenyang20 · 2025-04-01T17:37:08Z

Reward server stability: In large-scale reinforcement learning systems, the reward server must maintain a high level of stability, including capabilities such as load balancing, rate limiting, and long-text processing. While these are not strictly algorithmic requirements, they are critically important.

@catqaq Currently, we do not have people working on this. Could you recommend people to us on this? Also, the RL tarcker is here:

zhaochenyang20/Awesome-ML-SYS-Tutorial#74

sraj18-neubus · 2025-04-02T20:26:25Z

Hi Team, Is there any update when pipeline parallelism will be integrated into the SGLang ?

ykcai-daniel · 2025-04-03T15:17:48Z

I am interested in adding torchao support for more models. Which model should I start with?

guoyejun · 2025-04-15T07:33:23Z

RL Framework integration

Is there some basic document to explain the current RL support in SGLang? For example, a simple example on how the developer/user will use it, how about the dependencies, etc. thanks.

ykcai-daniel · 2025-04-17T16:53:09Z

We have created a new CuDNN backend that caches with execution graphs #5505. The performance is close to flashinfer backend.

shaoyuyoung · 2025-04-24T13:48:50Z

Currently, SGL version is v0.4.5, is there any plan for v1.0?

zhaochenyang20 · 2025-04-24T18:56:15Z

Currently, SGL version is v0.4.5, is there any plan for v1.0?

You can join us to make this!

kyle-pena-kuzco · 2025-04-28T23:07:32Z

Is "adaptive speculative decoding according to batch size" referring to this paper? https://arxiv.org/pdf/2412.18910

Lyken17 · 2025-04-30T15:48:05Z

As part of VLM models, @futrime and I have added NVILA into SGLang. Now cleaning up the code and preparing the PR.

artetaout · 2025-05-22T09:41:24Z

We've integrate a sparse attn and show its improvement while holding accuracy, is this welcomed ? If so, we will raise a PR; #6513

Swipe4057 · 2025-05-22T15:37:08Z

@Swipe4057Спасибо. Мы рассмотрим это и объединим.

@zhyncs @merrymercy
Please help with reviewing and merging the PR! #3930

zhyncs pinned this issue Mar 4, 2025

zhyncs assigned zhyncs, merrymercy, ByronHsu, Ying1123, yizhang2077, zhaochenyang20, ispobock, HaiShaw and Fridge003 Mar 4, 2025

zhyncs mentioned this issue Mar 4, 2025

Development Roadmap (2024 Q4) #1487

Closed

37 tasks

Fridge003 mentioned this issue Mar 4, 2025

[Feature] When will pipeline model parallelism be supported? #4059

Closed

2 tasks

lmxyy mentioned this issue Mar 12, 2025

Nunchaku March Development Roadmap mit-han-lab/nunchaku#201

Closed

33 tasks

merrymercy mentioned this issue Feb 24, 2025

[Feature] DeepSeek V3 optimization #2591

Closed

20 tasks

yicwang mentioned this issue Mar 17, 2025

[Feature] EP/TP optimization for DeepSeek v3/R1 bytedance-iaas/sglang#3

Open

2 tasks

zhyncs added the collaboration label Mar 25, 2025

yyihuang mentioned this issue Mar 25, 2025

Unsloth Model Support #4746

Closed

6 tasks

kingzevin mentioned this issue Mar 28, 2025

[Feature] Mooncake CPP (Chunked Pipeline Parallelism) #4842

Closed

2 tasks

adarshxs mentioned this issue Apr 1, 2025

Issue when running latest Mistral model #4948

Closed

yyihuang mentioned this issue Apr 2, 2025

[Model Support] unsloth/Phi-4-mini bnb model #4982

Merged

6 tasks

XueyingJia mentioned this issue Apr 15, 2025

Unsloth gguf support #5439

Draft

6 tasks

Development Roadmap (2025 H1) #4042

Development Roadmap (2025 H1) #4042

Comments

zhyncs commented Mar 4, 2025 • edited by Fridge003 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Focus

Parallelism

Attention Backend

Caching

Kernel

Quantization

RL Framework integration

Core refactor

Speculative decoding

Multi-LoRA serving

Hardware

Model coverage

Function Calling

Others

artetaout commented Mar 4, 2025

Uh oh!

Swipe4057 commented Mar 4, 2025

Uh oh!

zhaochenyang20 commented Mar 4, 2025

Uh oh!

Zhuohao-Li commented Mar 4, 2025

Uh oh!

SandroPats commented Mar 11, 2025

Uh oh!

zhyncs commented Mar 11, 2025

Uh oh!

artetaout commented Mar 13, 2025

Uh oh!

Zhuohao-Li commented Mar 13, 2025

Uh oh!

catqaq commented Apr 1, 2025

Uh oh!

zhaochenyang20 commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sraj18-neubus commented Apr 2, 2025

Uh oh!

ykcai-daniel commented Apr 3, 2025

Uh oh!

guoyejun commented Apr 15, 2025

Uh oh!

ykcai-daniel commented Apr 17, 2025

Uh oh!

shaoyuyoung commented Apr 24, 2025

Uh oh!

zhaochenyang20 commented Apr 24, 2025

Uh oh!

kyle-pena-kuzco commented Apr 28, 2025

Uh oh!

Lyken17 commented Apr 30, 2025

Uh oh!

artetaout commented May 22, 2025

Uh oh!

Swipe4057 commented May 22, 2025

Uh oh!

zhyncs commented Mar 4, 2025 •

edited by Fridge003

Loading

zhaochenyang20 commented Apr 1, 2025 •

edited

Loading