Skip to content

set backend correctly for CUDA+FSDP2+cpu-offload #3574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2025
Merged

set backend correctly for CUDA+FSDP2+cpu-offload #3574

merged 3 commits into from
May 15, 2025

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented May 15, 2025

What does this PR do?

Supersedes #3544

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc merged commit cd37bbb into main May 15, 2025
28 of 29 checks passed
@SunMarc SunMarc deleted the offload-fsdp branch May 15, 2025 09:38
@universuen
Copy link
Contributor

@SunMarc Hi, it is really a nice patch! However, I found a corner case when setting fsdp using kwargs like this:

Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision='bf16',
    fsdp_plugin=FullyShardedDataParallelPlugin(
        fsdp_version=2,
        cpu_offload=True,
    ),
)

Currently, I have to set the backend explicitly to avoid the error, but didn't have time to find a final solution to this.

Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision='bf16',
    fsdp_plugin=FullyShardedDataParallelPlugin(
        fsdp_version=2,
        cpu_offload=True,
    ),
    kwargs_handlers=[
        InitProcessGroupKwargs(
            backend="cuda:nccl,cpu:gloo"
        ),
    ]
)

@SunMarc
Copy link
Member Author

SunMarc commented Aug 5, 2025

Indeed that's an edge case that we might need to fix if we want to allow users to depend only on the plugin in the future. cc @S1ro1

@SunMarc
Copy link
Member Author

SunMarc commented Aug 5, 2025

I guess the easiest way for now is to update kwargs that is passed in partial state depending on fsdp_plugin

@S1ro1
Copy link
Member

S1ro1 commented Aug 5, 2025

I guess the easiest way for now is to update kwargs that is passed in partial state depending on fsdp_plugin

I think we should just set gloo by default together with nccl if fsdp2 is happening, i.e. async checkpointing I work on also requires gloo so I feel like defaulting to both is sensible, even costing a little overhead in launch

@SunMarc
Copy link
Member Author

SunMarc commented Aug 5, 2025

Okay, then we can do that in the async checkpoint pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants