-
Notifications
You must be signed in to change notification settings - Fork 1.2k
set backend correctly for CUDA+FSDP2+cpu-offload #3544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -214,6 +214,8 @@ def __init__(self, cpu: bool = False, **kwargs): | |||
if self.backend == "tccl": | |||
local_rank = os.environ.get("LOCAL_RANK", -1) | |||
torch.sdaa.set_device(f"sdaa:{local_rank}") | |||
if self.backend == "nccl" and os.environ.get("ACCELERATE_USE_FSDP", "false") == "true" and os.environ.get("FSDP_VERSION", "1") == "2": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this check for CPU offload? If you aren't offloading, you don't need to initialize a GLOO backend as well as the NCCL one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joecummings does it hurt anything to leave cpu:gloo in if not offloading?
@S1ro1 I can also add another check for FSDP_OFFLOAD_PARAMS=="true"
, lmk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do that, it won't cover all the cases but at least most of them. Though if there's zero to none overhead of having both initialised, I prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heard from Tristan (resident GLOO expert) that you can probably expect some overhead in terms of time if initializing a GLOO backend for CPU as well as a NCCL backend. I don't have hard numbers here, however.
FWIW, I believe the only two operations you need to check for are 1) FSDP CPU Offload and 2) Async checkpointing b/c these are the only two that utilize a separate process group in order to not block comms as much as possible. You could have just a flag that checks for parameters that complete cpu operations in the background and then check for this in the setting of the backend. It would be similar to how we do it in torchtune: https://github.com/pytorch/torchtune/blob/5d51c25cedfb6ba7b00e03cb2fef4f9cdb7baebd/recipes/full_finetune_distributed.py#L142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation @joecummings ! Let's just check for offload then and add a comment for async checkpointing for when we will add this.
Superseded by #3574 |
What does this PR do?
with FSDP2, fsdp_offload_params: true doesn't work and gives this error https://gist.github.com/winglian/700e5ab9021911f6d1c095b4dbdec5e5
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@S1ro1
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.