-
Notifications
You must be signed in to change notification settings - Fork 364
Remove all_reduce altogether and shard the optimizer(new WR) #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you very much for the record submission. I'll aim to reproduce it within the next week. It will have priority in case any other submissions come in later. |
Also in my further testing, the loss in the master branch is greater than 3.28 with statistical significance(It's around 3.281). My guess is that it was caused by the change in constants in "21st record with latest torch" change like mentioned in 1. Here are the losses for the multiple runs. Both averages are greater than 3.28 with p < 0.05
|
Good Job! |
I'm not sure if |
btw did you have any idea better than re-introducing grouping parameters by size? |
Autocast is not making I did try to compile the Adam, but it didn't change the time a bit as the time is dominated by data movement across GPU, and the computation happens while the |
For Adam, we don't need any grouping as parameters could be split and the implementation is simpler. In fact this approach seems significantly better than ZeRO-1 as we don't need to gather the optimizer params at all. I can't find any other place which uses this idea. For Muon, I can't think of anything other than grouping params by size. |
yep I don't expect that compiled Adam can be faster because of the overlapped communication, but it should save some memory haha. |
wdym by "this approach seems significantly better than ZeRO-1 as we don't need to gather the optimizer params at all"? IIUC you perform the |
I said optimizer params( |
I have removed autocasts and added torch compile to Adam step, as per your comment |
I bet you're right that the change in constants would induce the extra .001 loss. There's no reason I didn't accept til now other than that I was dreading/procrastinating having to figure out what caused the increase in loss. But now that there's a new record that came later which lowers the loss, so I can just accept both. Accepting record |
Not waiting for optimizer to finish before next step, as you mentioned, could potentially produce a new record. But it would require gathering statistical significance because it could change the forward pass, rather than being a pure systems win. |
hm... a pure systems win that wouldn't mess with the forward pass would be waiting on both optimizers (the adam and the muon) at the same time, rather than doing each one sequentially |
@vagrawal any accounts you want me to plug in X.com announcement? |
This change replaces
all_reduce
withreduce_scatter
and shards the optimizer parameters correspondingly saving 2-2.5ms/batch in runtime over current WR(100.5 vs ~103ms in my machine). It also reduces memory for Adam parameters which were replicated on all nodes.I also experimented with not waiting for parameter update to finish before starting next batch which seems to work fine and saves another 1ms. Just comment out
TODO
section to test it.