Skip to content

Development Roadmap (2025 H1) #4042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
23 of 61 tasks
zhyncs opened this issue Mar 4, 2025 · 20 comments
Open
23 of 61 tasks

Development Roadmap (2025 H1) #4042

zhyncs opened this issue Mar 4, 2025 · 20 comments

Comments

@zhyncs
Copy link
Member

zhyncs commented Mar 4, 2025

Here is the development roadmap for 2025 H1. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). The previous 2024 Q4 roadmap can be found in #1487

Focus

  • Throughput-oriented large-scale deployment similar to the deepseek inference system
  • Long context optimizations
  • Low latency speculative decoding
  • Reinforcement learning training framework integration
  • Kernel optimizations

Parallelism

Attention Backend

Caching

Kernel

Quantization

RL Framework integration

Core refactor

Speculative decoding

Multi-LoRA serving

Hardware

Model coverage

Function Calling

Others

@artetaout
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

@Swipe4057
Copy link

As part of long context optimization, the implementation of HiP #3930 attention will be considered?

@zhaochenyang20
Copy link
Collaborator

@Swipe4057 Thanks. We will review this and merge it

@Zhuohao-Li
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

@SandroPats
Copy link

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

@zhyncs
Copy link
Member Author

zhyncs commented Mar 11, 2025

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

Hi @SandroPats Please join https://slack.sglang.ai and discuss at #quantization Thanks!

@artetaout
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

@Zhuohao-Li
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

In TE, if you need to enable tp overlap only in inference, you need to split sequences manually (SP/TP). I guess that's perhaps the reason you did not see improvement. You can join https://slack.sglang.ai/ and find me to discuss further

@catqaq
Copy link

catqaq commented Apr 1, 2025

Reward server stability: In large-scale reinforcement learning systems, the reward server must maintain a high level of stability, including capabilities such as load balancing, rate limiting, and long-text processing. While these are not strictly algorithmic requirements, they are critically important.

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Apr 1, 2025

Reward server stability: In large-scale reinforcement learning systems, the reward server must maintain a high level of stability, including capabilities such as load balancing, rate limiting, and long-text processing. While these are not strictly algorithmic requirements, they are critically important.

@catqaq Currently, we do not have people working on this. Could you recommend people to us on this? Also, the RL tarcker is here:

zhaochenyang20/Awesome-ML-SYS-Tutorial#74

@sraj18-neubus
Copy link

Hi Team, Is there any update when pipeline parallelism will be integrated into the SGLang ?

@ykcai-daniel
Copy link

I am interested in adding torchao support for more models. Which model should I start with?

@guoyejun
Copy link

RL Framework integration

Is there some basic document to explain the current RL support in SGLang? For example, a simple example on how the developer/user will use it, how about the dependencies, etc. thanks.

@XueyingJia XueyingJia mentioned this issue Apr 15, 2025
6 tasks
@ykcai-daniel
Copy link

We have created a new CuDNN backend that caches with execution graphs #5505. The performance is close to flashinfer backend.

@shaoyuyoung
Copy link

Currently, SGL version is v0.4.5, is there any plan for v1.0?

@zhaochenyang20
Copy link
Collaborator

Currently, SGL version is v0.4.5, is there any plan for v1.0?

You can join us to make this!

@kyle-pena-kuzco
Copy link
Contributor

Is "adaptive speculative decoding according to batch size" referring to this paper? https://arxiv.org/pdf/2412.18910

@Lyken17
Copy link

Lyken17 commented Apr 30, 2025

As part of VLM models, @futrime and I have added NVILA into SGLang. Now cleaning up the code and preparing the PR.

@artetaout
Copy link

We've integrate a sparse attn and show its improvement while holding accuracy, is this welcomed ? If so, we will raise a PR; #6513

@Swipe4057
Copy link

@Swipe4057Спасибо. Мы рассмотрим это и объединим.

@zhyncs @merrymercy
Please help with reviewing and merging the PR! #3930

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment