-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix FP8 tests, enable FP8 to be used without direct Accelerator()
configuring
#3677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you, very happy to iterate on this when I get back from vacation, until then cc @SunMarc for general design.
Yes, that test was broken on yesterday's CI because of a BC in |
@@ -7,7 +7,7 @@ RUN pip install transformers evaluate datasets | |||
RUN git clone https://github.com/huggingface/accelerate.git | |||
|
|||
RUN cd accelerate && \ | |||
pip install -e . && \ | |||
pip install -e .[deepspeed] && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the container used for fp8 CI tests, which include deepspeed. so even though we don't use it in these benchmark scripts (which we should double-check are still functional 😄), this allows the requires_deepspeed
tests to run in the fp8 tests.
@@ -11,8 +11,8 @@ fp8_config: | |||
fp8_format: E4M3 | |||
interval: 1 | |||
margin: 0 | |||
override_linear_precision: (false, false, false) | |||
override_linear_precision: [false, false, false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't exercised in CI anywhere but I caught this bug while using this to debug locally 🤷 . The ()
expression just gets evaluated to a string rather than a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
@@ -33,6 +33,8 @@ | |||
import torch.utils.hooks as hooks | |||
from huggingface_hub import split_torch_state_dict_into_shards | |||
|
|||
from accelerate.utils.dataclasses import FP8BackendType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already had this enum, so I figured it was worth using this here instead of the string comparisons
# Prioritize AO -> TE -> MSAMP | ||
if is_torchao_available(): | ||
logger.info("Found `torchao` installed, using it for FP8 training.") | ||
if self.fp8_backend == FP8BackendType.AO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we first defer to the fp8_backend specified in the yaml, and only if we didn't specify one do we revert to the AO -> TE -> MSAMP preference.
if not args.test_te and not args.test_ao: | ||
raise ValueError("Must specify at least one of --test_te or --test_ao") | ||
|
||
if args.test_te: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than checking if TE is available twice (once when dispatching the test and once inside the test), we just check once when deciding what tests to run. This way we don't end up running the same tests twice if we have both TE and AO installed; and the failures are more fine-grained.
if ( | ||
self.fp8_backend == FP8BackendType.AO | ||
and self.state.distributed_type == DistributedType.FSDP | ||
and self.state.fsdp_plugin.cpu_ram_efficient_loading | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to submit a PR that made this exact change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: huggingface/transformers#39370
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah IIRC this was where some of the existing fp8 tests were failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job fixing these ! Really appreciate it !
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks for the new tests !
What does this PR do?
It looks like the current fp8 tests are not passing, this PR slightly refactors these tests and makes a few fixes to get them to pass.
It also adds (and fixes) new tests that ensure FP8 functionality can be configured entirely from an accelerate config yaml in the cases where the user doesn't have the ability to change how the
Accelerator()
object is created. This currently seems broken, since you need to be able to pass akwargs_handlers
to choose the FP8 backend. With thetransformers.Trainer
class, for instance, theAccelerator()
object is created under-the-hood, so this should enable FP8 training with that class simply by changing the accelerator config yaml.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr
@zach-huggingface
All tests pass locally with the
huggingface/accelerate:gpu-fp8-transformerengine-nightly
container after installing deepspeed, with the exception ofbut, that test fails in the same way for me on
main
, so I don't think it's related to this change.