Description
For extreme GPU memory saving, we currently use communication queues for NVLink and RDMA buffers. This means tokens cyclically reuse a small buffer - when the queue is full, no new tokens are transmitted, and transfers only occur when the queue has space.
However, this approach has drawbacks: it can potentially cause deadlocks, repeatedly polling the queue introduces latency, and reaching maximum performance requires complex implementation. You can see this reflected in our internode code, where adding new features comes at a significant cost.
If you're referencing our code but want to design your own implementation, we also suggest a simpler overall design for your consideration:
- Allocate buffers directly based on the maximum possible number of tokens (which might be very large)
- This allows direct address calculation when sending, eliminating the need for a dynamic queue
- Considering that MoE training tends to have relatively uniform token distribution when stable:
- Implement a dynamic buffer resizing strategy
- Expand the buffer when any rank's buffer size is insufficient
- Shrink the buffer when it hasn't been fully utilized for an extended period
- Maybe SM-free can be achieved
Overall, this approach might use more GPU memory (the exact amount depends on the specific scenario), but the implementation would be much simpler. You could more easily add new features, and the performance ceiling might be slightly better.
Thanks to @KnowingNothing from ByteDance for discussing and suggesting this approach!