-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Roadmap] Prefill and Decoding Disaggregation #4655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good job! @ByronHsu We mooncake team will integrate the mooncake transfer engine to PD disaggregation ASAP. Related PR will be available soon. Thx. |
Thanks @stmatengss! Please let me know when it's ready as it is our top priority to complete it. We will ensure the review and merge process goes smoothly! |
+1. Will be on it ASAP. Cheers for the collaboration. |
@ByronHsu @ShangmingCai AMD's support on this and Mooncake will be fully available soon. Thanks. |
Hi @ByronHsu, I'll be working on the NVIDIA NIXL integration |
hi, @ByronHsu , I have a question of you design. Is the pre-allocated memory GPU memory or CPU memory?if that is GPU memory, so it could use RDMA GPUDirect copy. But the drawback is decoder may allocate too many GPU memory before computation started? |
@thesues I have the same opinion as you, but I recently learned about the Dynamo and found that the PD separation implemented also retains the P2P transmission path. I haven't figured out the relationship between it and multi-level cache. Perhaps different transmission paths are needed for different levels of kv cache. Correspondingly, a unique design is also needed in the upper-level scheduling. |
@ByronHsu I've started a WIP branch here for NIXL: trevor-m@6d862c5 |
I am currently heavily experimenting with dynamo integration. Does anyone share the same interest? |
Good point! @thesues that might be the case if decode's memory is constrained. However, the existing design works well for us under reasonable QPS. Recently i read this paper https://arxiv.org/html/2501.14743v1, and it suggested a pull-based model which might worth a try, but would need some modification on the current kv transfer interface. |
@ByronHsu Whether using the pull or push model, this functionality could be hidden within the transfer engine. The inference framework would simply place tokens into the transfer engine and consume them from it, essentially treating the transfer engine as a queue. This may can keep SGLANG simpler and we can offer a abstract layer to hidden different transfer engines. |
The design doc have very carefully consideration, maybe we can stand on a higher layer, so we can move forward faster? such as the scatter-gather elements (SGE) in RDMA is useful, but this can also be done on the common network transfer, this is used widely in the linux kernel. |
I'm interested. Will reach-out in SGL slack. |
Hi @Venkat2811 , I'm also working on a PR for rust pd load balancer. Maybe we can work on it together? |
Hi @ByronHsu , currently, P/D Disaggregation seems to lack support for dp-attention. For example, when enabling DP, |
can I join PD task? @ByronHsu |
Hi, @ByronHsu ,I'm interested in this part. Could i take it ? Cc @stmatengss @ShangmingCai |
New NIXL transfer engine PR: #5477 |
Hi @ByronHsu , I notice that in earlier designs, KV Transfer is designed to be layer-by-layer and chunk-by-chunk. Is there any consideration regarding the removal of this part of the design? |
I encountered this error while running in a container environment: |
@CSEEduanyu Does your env support GDR? If not, then your env cannot run PD with SGLang, and Mooncake will report failures when registering your GPU mem. |
i can run gdrcopy_copybw : |
@CSEEduanyu Please open an issue in the mooncake repo, and provide a detailed log to help us identify the root cause of your problem. |
Is there a comprehensive benchmark to verify the improvement of PD Disaggregation? @ByronHsu |
I run PD disaggregation in 8*L20 server. I use docker with image
and get the
my nvidia-smi topo is
I have try to set |
@wqlxx Try removing this config instead of setting it to False? Also, please make sure that the docker has sudo permit and is setting with privileged. |
Uh oh!
There was an error while loading. Please reload this page.
Design:
SGLang PD Disaggregation (Open Source)
Progress
Release initial code @ByronHsu [PD] Release initial code #4654
Mooncake integration @ShangmingCai https://github.com/sgl-project/sglang/pulls?q=is%3Apr+mooncake+is%3Aopen
NIXL Integration @trevor-m [PD] Add NIXL transfer backend #5477
PD + overlap schedule @ByronHsu
PD + DP attention @ch-wan @ByronHsu
PD + fault tolerance [PD] Abort request if transfer fails #6504 [PD] Handle P/D failure and reconnect without affecting other instances #6263
PD + spec decode [PD] support spec decode #6507
PD + logprob [PD] Support logprob & Add failure test #6558
PD + Structured Output [PD] Support structured output #6560
PD + retract @Ying1123
PD + different TPs - call out for contribution [PD] Add support for different TP sizes per DP rank #5922
Rust PD Load Balancer @hnyls2002 Init PD Rust LB (PO2) #6437
PD + ROCm (Mooncake) @HaiShaw
The text was updated successfully, but these errors were encountered: