-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Development Roadmap (2025 H1) #4042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, about |
As part of long context optimization, the implementation of HiP #3930 attention will be considered? |
@Swipe4057 Thanks. We will review this and merge it |
Hi @artetaout , now it is |
Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1? |
Hi @SandroPats Please join https://slack.sglang.ai and discuss at #quantization Thanks! |
Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16 |
In TE, if you need to enable tp overlap only in inference, you need to split sequences manually (SP/TP). I guess that's perhaps the reason you did not see improvement. You can join https://slack.sglang.ai/ and find me to discuss further |
Reward server stability: In large-scale reinforcement learning systems, the reward server must maintain a high level of stability, including capabilities such as load balancing, rate limiting, and long-text processing. While these are not strictly algorithmic requirements, they are critically important. |
@catqaq Currently, we do not have people working on this. Could you recommend people to us on this? Also, the RL tarcker is here: |
Hi Team, Is there any update when pipeline parallelism will be integrated into the SGLang ? |
I am interested in adding torchao support for more models. Which model should I start with? |
Is there some basic document to explain the current RL support in SGLang? For example, a simple example on how the developer/user will use it, how about the dependencies, etc. thanks. |
We have created a new CuDNN backend that caches with execution graphs #5505. The performance is close to flashinfer backend. |
Currently, SGL version is v0.4.5, is there any plan for v1.0? |
You can join us to make this! |
Is "adaptive speculative decoding according to batch size" referring to this paper? https://arxiv.org/pdf/2412.18910 |
We've integrate a sparse attn and show its improvement while holding accuracy, is this welcomed ? If so, we will raise a PR; #6513 |
@zhyncs @merrymercy |
Uh oh!
There was an error while loading. Please reload this page.
Here is the development roadmap for 2025 H1. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). The previous 2024 Q4 roadmap can be found in #1487
Focus
Parallelism
Attention Backend
Caching
Kernel
Quantization
RL Framework integration
Core refactor
scheduler.py
andmodel_runner.py
to make them more modularSpeculative decoding
Multi-LoRA serving
Hardware
Model coverage
Function Calling
Others
sglang/docs/references/faq.md
Line 3 in 8912b76
The text was updated successfully, but these errors were encountered: