when delaying optimizer creation only prepare the model #39152

winglian · 2025-07-01T14:32:07Z

What does this PR do?

Axolotl's CI caught a regression when we tried to upgrade to latest transformers. https://github.com/axolotl-ai-cloud/axolotl/actions/runs/15962262932/job/45016550543

PR #36132 introduced a regression breaking FSDP w llama

stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward                                                                                                          
stderr: [rank0]:     inputs_embeds = self.embed_tokens(input_ids)                                                                                                                                                                                              
stderr: [rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                              
stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl                                                                                                               
stderr: [rank0]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                   
stderr: [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                   
stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl                                                                                                                       
stderr: [rank0]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                      
stderr: [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                      
stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 190, in forward                                                                                                                           
stderr: [rank0]:     return F.embedding(                                                                                                                                                                                                                       
stderr: [rank0]:            ^^^^^^^^^^^^                                                                                                                                                                                                                       stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/functional.py", line 2551, in embedding                                                                                                                            
stderr: [rank0]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)                                                                                                                                                            stderr: [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
stderr: [rank0]: RuntimeError: Output 0 of ViewBackward0 is a view and its base or another view of its base has been modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

and FSDP+DPO+qwen

stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward                                                                                                          
stderr: [rank0]:     inputs_embeds = self.embed_tokens(input_ids)                                                                                                                                                                                              
stderr: [rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                              
stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl                                                                                                               
stderr: [rank0]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                   
stderr: [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                   
stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl                                                                                                                       
stderr: [rank0]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                      
stderr: [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                      
stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 190, in forward                                                                                                                           
stderr: [rank0]:     return F.embedding(                                                                                                                                                                                                                       
stderr: [rank0]:            ^^^^^^^^^^^^                                                                                                                                                                                                                       stderr: [rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/functional.py", line 2551, in embedding                                                                                                                            
stderr: [rank0]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)                                                                                                                                                            stderr: [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
stderr: [rank0]: RuntimeError: Output 0 of ViewBackward0 is a view and its base or another view of its base has been modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Cyrilvallez · 2025-07-01T14:54:18Z

cc @SunMarc

ArthurZucker · 2025-07-01T17:02:13Z

src/transformers/trainer.py

@@ -2357,7 +2357,7 @@ def _inner_training_loop(
                    model = self.accelerator.prepare(self.model)
                else:
                    if delay_optimizer_creation:


cc @IlyasMoutawwakil as you wanted to remove this 👀

yeah this fixes it too ! I honestly don't understand delay_optimizer_creation, like delay until when and why ? 😅 might make sense to explain it somewhere in the trainer

you see why I removed it, is because currently we do create the optimizer here, and we need to prepare the fsdp model as well (otherwise fsdp fails), so the two branches of the if statement become the same

SunMarc · 2025-07-03T09:29:42Z

cc @kmehant, if you can explain the change you tried to do, that would be helpful !

kmehant · 2025-07-03T10:24:33Z

Hi @SunMarc thanks for looping me in! Appreciate it.

Ideally this block of code

transformers/src/transformers/trainer.py

Lines 2344 to 2350 in b31e9d1

    
           if delay_optimizer_creation: 
        
               if use_accelerator_prepare: 
        
                   # configure fsdp plugin for qlora if any 
        
                   self._fsdp_qlora_plugin_updates() 
        
                   if self.accelerator.mixed_precision != "fp8": 
        
                       self.model = self.accelerator.prepare(self.model) 
        
               self.create_optimizer_and_scheduler(num_training_steps=max_steps)

should be doing model preparation using accelerate even for FSDP and TP (if you remember in the older version we TPlize the model in accelerate prepare which is not the latest cases so we are good) cases so that the model is wrapped and then the block does the optimizer creation taking in the accelerate prepared model parameters. After which it comes to the current block

transformers/src/transformers/trainer.py

Lines 2353 to 2362 in b31e9d1

    
           if use_accelerator_prepare: 
        
               self.model.train() 
        
               if hasattr(self.lr_scheduler, "step"): 
        
                   if self.use_apex: 
        
                       model = self.accelerator.prepare(self.model) 
        
                   else: 
        
                       if delay_optimizer_creation: 
        
                           model = self.accelerator.prepare(self.model) 
        
                       else: 
        
                           model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)

which is being modified in the code where a prepare again for the model is not needed rather only for the optimizer since the previous created optimizer didn't undergo accelerate prepare. That was the rationale behind this change. Ideally instead of the change made in this PR, I think we should have simply modified

self.model = self.accelerator.prepare(self.model)

to

model = self.accelerator.prepare(self.model)

here -

transformers/src/transformers/trainer.py

Line 2349 in b31e9d1

self.model = self.accelerator.prepare(self.model)

OR

We can also go back to the older code too, since TPlizing the model is removed from accelerate prepare step which works as well.

model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)

for FSDP case. I can help in contribution if needed.

Nonetheless I +1 to @IlyasMoutawwakil to remove this all together since its always been a confusing parameter to me :)

cc: @ArthurZucker @winglian

kmehant · 2025-07-03T15:45:21Z

@SunMarc @IlyasMoutawwakil @ArthurZucker

This is much more correct fix for this bug - PR: #39177. The current PR breaks TP trainings (since prepare is not needed for TP and enforcing prepare would lead to DDP setting which fails). The aforementioned PR fixes for both FSDP and TP.

when delaying optimizer creation only prepare the model

b0150b0

kashif added the for patch Tag issues / labels that should be included in the next patch label Jul 1, 2025

ArthurZucker reviewed Jul 1, 2025

View reviewed changes

winglian mentioned this pull request Jul 1, 2025

Fix many HPU failures in the CI #39066

Merged

5 tasks

ArthurZucker approved these changes Jul 3, 2025

View reviewed changes

ArthurZucker merged commit 8178c43 into huggingface:main Jul 3, 2025
18 checks passed

IlyasMoutawwakil mentioned this pull request Jul 3, 2025

FSDP RuntimeError: 'weight' must be 2-D #39186

Closed

kaixuanliu mentioned this pull request Jul 4, 2025

fix bug using FSDP V1 will lead to model device not properly set #39177

Merged

Cyrilvallez pushed a commit that referenced this pull request Jul 4, 2025

when delaying optimizer creation only prepare the model (#39152)

63af3d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

when delaying optimizer creation only prepare the model #39152

when delaying optimizer creation only prepare the model #39152

Uh oh!

winglian commented Jul 1, 2025 •

edited

Loading

Uh oh!

Cyrilvallez commented Jul 1, 2025

Uh oh!

ArthurZucker Jul 1, 2025

Uh oh!

IlyasMoutawwakil Jul 1, 2025 •

edited

Loading

Uh oh!

IlyasMoutawwakil Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

SunMarc commented Jul 3, 2025

Uh oh!

kmehant commented Jul 3, 2025 •

edited

Loading

Uh oh!

kmehant commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

when delaying optimizer creation only prepare the model #39152

when delaying optimizer creation only prepare the model #39152

Uh oh!

Conversation

winglian commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Cyrilvallez commented Jul 1, 2025

Uh oh!

ArthurZucker Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc commented Jul 3, 2025

Uh oh!

kmehant commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmehant commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

winglian commented Jul 1, 2025 •

edited

Loading

IlyasMoutawwakil Jul 1, 2025 •

edited

Loading

IlyasMoutawwakil Jul 1, 2025 •

edited

Loading

kmehant commented Jul 3, 2025 •

edited

Loading

kmehant commented Jul 3, 2025 •

edited

Loading