Remove all_reduce altogether and shard the optimizer(new WR) #102

vagrawal · 2025-05-30T13:15:17Z

This change replaces all_reduce with reduce_scatter and shards the optimizer parameters correspondingly saving 2-2.5ms/batch in runtime over current WR(100.5 vs ~103ms in my machine). It also reduces memory for Adam parameters which were replicated on all nodes.

I also experimented with not waiting for parameter update to finish before starting next batch which seems to work fine and saves another 1ms. Just comment out TODO section to test it.

KellerJordan · 2025-06-02T08:10:04Z

Thank you very much for the record submission. I'll aim to reproduce it within the next week. It will have priority in case any other submissions come in later.

vagrawal · 2025-06-02T18:29:44Z

Also in my further testing, the loss in the master branch is greater than 3.28 with statistical significance(It's around 3.281). My guess is that it was caused by the change in constants in "21st record with latest torch" change like mentioned in 1.

Here are the losses for the multiple runs. Both averages are greater than 3.28 with p < 0.05

losses_upstream = [3.2836, 3.2801, 3.2798, 3.2796, 3.2785, 3.2811, 3.2806, 3.2807, 3.2815, 3.2822, 3.2808, 3.2813, 3.2806, 3.2801, 3.2799, 3.2828, 3.2831, 3.2794, 3.2806, 3.2799, 3.2794]

losses_noallreduce = [3.28, 3.2817, 3.2805, 3.2796, 3.2772, 3.2807, 3.2841, 3.2829, 3.2818, 3.2819, 3.2817, 3.2822, 3.2806]

YouJiacheng · 2025-06-03T11:42:05Z

Good Job!
btw it seems that you didn't compile the DistAdam? (so it's not fused)
in addition, is autocast necessary?

YouJiacheng · 2025-06-03T11:45:37Z

I'm not sure if autocast will make p.grad to be bf16, so you might need to use the custom mixed precision implementation in 2.92 track to further reduce communication?

YouJiacheng · 2025-06-03T11:52:43Z

btw did you have any idea better than re-introducing grouping parameters by size?
I hesitated to implement reduce_scatter because it feels ugly to group parameters by size.
In addition, we should be able to use all_to_all + reduce if we group parameters by size -- so we can achieve FP32 accumulation precision with BF16 traffic. And all_to_all can be done with copy engine without using SMs.
see: pytorch/pytorch#130583

vagrawal · 2025-06-04T20:53:35Z

Autocast is not making p.grad to be bf16. It just allows us to remove the type_as from F.linear(x, self.weight.type_as(x)), which to my eyes is bit more ugly looking than autocast.

I did try to compile the Adam, but it didn't change the time a bit as the time is dominated by data movement across GPU, and the computation happens while the reduce_scatter is happening in parallel.

vagrawal · 2025-06-04T21:06:10Z

For Adam, we don't need any grouping as parameters could be split and the implementation is simpler. In fact this approach seems significantly better than ZeRO-1 as we don't need to gather the optimizer params at all. I can't find any other place which uses this idea.

For Muon, I can't think of anything other than grouping params by size.

YouJiacheng · 2025-06-05T10:04:03Z

yep I don't expect that compiled Adam can be faster because of the overlapped communication, but it should save some memory haha.

YouJiacheng · 2025-06-05T10:05:54Z

wdym by "this approach seems significantly better than ZeRO-1 as we don't need to gather the optimizer params at all"? IIUC you perform the all_gather in DistAdam.
oh, I guess you mean flatten all parameters into one flat tensor?

vagrawal · 2025-06-05T15:59:38Z

I said optimizer params(exp_avg and exp_avg_sq) not the model params. ZeRO-1 only partitions the optimizer params.

vagrawal · 2025-06-05T16:04:05Z

I have removed autocasts and added torch compile to Adam step, as per your comment

train_gpt.py

KellerJordan · 2025-07-14T05:45:14Z

I bet you're right that the change in constants would induce the extra .001 loss.

There's no reason I didn't accept til now other than that I was dreading/procrastinating having to figure out what caused the increase in loss. But now that there's a new record that came later which lowers the loss, so I can just accept both.

Accepting record

KellerJordan · 2025-07-14T06:14:26Z

Not waiting for optimizer to finish before next step, as you mentioned, could potentially produce a new record. But it would require gathering statistical significance because it could change the forward pass, rather than being a pure systems win.

KellerJordan · 2025-07-14T06:16:15Z

hm... a pure systems win that wouldn't mess with the forward pass would be waiting on both optimizers (the adam and the muon) at the same time, rather than doing each one sequentially

KellerJordan · 2025-07-15T05:20:12Z

@vagrawal any accounts you want me to plug in X.com announcement?

vagrawal added 2 commits May 30, 2025 12:40

No all reduce

afe0141

Add another run

3927808

vagrawal changed the title ~~Remove all_reduce altogether and shard the optimizer~~ Remove all_reduce altogether and shard the optimizer(new WR) May 30, 2025

Upstream style

16e2da0

MarktHart reviewed Jun 11, 2025

View reviewed changes

train_gpt.py Show resolved Hide resolved

ClassicLarry mentioned this pull request Jul 13, 2025

Align training batches to bos token (New WR 173.4s), incl prior changes from @vagrawal #108

Merged

KellerJordan merged commit 3e121a6 into KellerJordan:master Jul 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove all_reduce altogether and shard the optimizer(new WR) #102

Remove all_reduce altogether and shard the optimizer(new WR) #102

vagrawal commented May 30, 2025 •

edited

Loading

Uh oh!

KellerJordan commented Jun 2, 2025

Uh oh!

vagrawal commented Jun 2, 2025 •

edited

Loading

Uh oh!

YouJiacheng commented Jun 3, 2025

Uh oh!

YouJiacheng commented Jun 3, 2025 •

edited

Loading

Uh oh!

YouJiacheng commented Jun 3, 2025 •

edited

Loading

Uh oh!

vagrawal commented Jun 4, 2025

Uh oh!

vagrawal commented Jun 4, 2025

Uh oh!

YouJiacheng commented Jun 5, 2025

Uh oh!

YouJiacheng commented Jun 5, 2025

Uh oh!

vagrawal commented Jun 5, 2025

Uh oh!

vagrawal commented Jun 5, 2025

Uh oh!

Uh oh!

KellerJordan commented Jul 14, 2025

Uh oh!

KellerJordan commented Jul 14, 2025

Uh oh!

KellerJordan commented Jul 14, 2025

Uh oh!

KellerJordan commented Jul 15, 2025

Uh oh!

Uh oh!

Remove all_reduce altogether and shard the optimizer(new WR) #102

Remove all_reduce altogether and shard the optimizer(new WR) #102

Conversation

vagrawal commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KellerJordan commented Jun 2, 2025

Uh oh!

vagrawal commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YouJiacheng commented Jun 3, 2025

Uh oh!

YouJiacheng commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YouJiacheng commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vagrawal commented Jun 4, 2025

Uh oh!

vagrawal commented Jun 4, 2025

Uh oh!

YouJiacheng commented Jun 5, 2025

Uh oh!

YouJiacheng commented Jun 5, 2025

Uh oh!

vagrawal commented Jun 5, 2025

Uh oh!

vagrawal commented Jun 5, 2025

Uh oh!

Uh oh!

KellerJordan commented Jul 14, 2025

Uh oh!

KellerJordan commented Jul 14, 2025

Uh oh!

KellerJordan commented Jul 14, 2025

Uh oh!

KellerJordan commented Jul 15, 2025

Uh oh!

Uh oh!

vagrawal commented May 30, 2025 •

edited

Loading

vagrawal commented Jun 2, 2025 •

edited

Loading

YouJiacheng commented Jun 3, 2025 •

edited

Loading

YouJiacheng commented Jun 3, 2025 •

edited

Loading