Skip to content

Fix FP8 tests, enable FP8 to be used without direct Accelerator() configuring #3677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 15, 2025

Conversation

pstjohn
Copy link
Contributor

@pstjohn pstjohn commented Jul 10, 2025

What does this PR do?

It looks like the current fp8 tests are not passing, this PR slightly refactors these tests and makes a few fixes to get them to pass.

It also adds (and fixes) new tests that ensure FP8 functionality can be configured entirely from an accelerate config yaml in the cases where the user doesn't have the ability to change how the Accelerator() object is created. This currently seems broken, since you need to be able to pass a kwargs_handlers to choose the FP8 backend. With the transformers.Trainer class, for instance, the Accelerator() object is created under-the-hood, so this should enable FP8 training with that class simply by changing the accelerator config yaml.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@muellerzr
@zach-huggingface

All tests pass locally with the huggingface/accelerate:gpu-fp8-transformerengine-nightly container after installing deepspeed, with the exception of

FAILED tests/test_metrics.py::MetricTester::test_metric_accelerator_multi - RuntimeError: 'accelerate launch --num_processes=2 --monitor_interval=0.1 /workspaces/accelerate/src/accelerate/test_utils/scripts/external_deps/test_metrics.py' failed with returncode 1

stderr: [rank1]:   File "/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed/glue.py", line 84, in simple_accuracy
stderr: [rank1]:     return float((preds == labels).mean())
stderr: [rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^
stderr: [rank1]: AttributeError: 'bool' object has no attribute 'mean'

but, that test fails in the same way for me on main, so I don't think it's related to this change.

@pstjohn pstjohn marked this pull request as ready for review July 11, 2025 01:12
@S1ro1
Copy link
Member

S1ro1 commented Jul 11, 2025

Thank you, very happy to iterate on this when I get back from vacation, until then cc @SunMarc for general design.

All tests pass locally with the huggingface/accelerate:gpu-fp8-transformerengine-nightly container after installing deepspeed, with the exception of

Yes, that test was broken on yesterday's CI because of a BC in datasets, should work now.

@@ -7,7 +7,7 @@ RUN pip install transformers evaluate datasets
RUN git clone https://github.com/huggingface/accelerate.git

RUN cd accelerate && \
pip install -e . && \
pip install -e .[deepspeed] && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the container used for fp8 CI tests, which include deepspeed. so even though we don't use it in these benchmark scripts (which we should double-check are still functional 😄), this allows the requires_deepspeed tests to run in the fp8 tests.

@@ -11,8 +11,8 @@ fp8_config:
fp8_format: E4M3
interval: 1
margin: 0
override_linear_precision: (false, false, false)
override_linear_precision: [false, false, false]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't exercised in CI anywhere but I caught this bug while using this to debug locally 🤷 . The () expression just gets evaluated to a string rather than a list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

@@ -33,6 +33,8 @@
import torch.utils.hooks as hooks
from huggingface_hub import split_torch_state_dict_into_shards

from accelerate.utils.dataclasses import FP8BackendType
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already had this enum, so I figured it was worth using this here instead of the string comparisons

# Prioritize AO -> TE -> MSAMP
if is_torchao_available():
logger.info("Found `torchao` installed, using it for FP8 training.")
if self.fp8_backend == FP8BackendType.AO:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we first defer to the fp8_backend specified in the yaml, and only if we didn't specify one do we revert to the AO -> TE -> MSAMP preference.

if not args.test_te and not args.test_ao:
raise ValueError("Must specify at least one of --test_te or --test_ao")

if args.test_te:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than checking if TE is available twice (once when dispatching the test and once inside the test), we just check once when deciding what tests to run. This way we don't end up running the same tests twice if we have both TE and AO installed; and the failures are more fine-grained.

Comment on lines +501 to +505
if (
self.fp8_backend == FP8BackendType.AO
and self.state.distributed_type == DistributedType.FSDP
and self.state.fsdp_plugin.cpu_ram_efficient_loading
):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to submit a PR that made this exact change :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah IIRC this was where some of the existing fp8 tests were failing

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job fixing these ! Really appreciate it !

@SunMarc SunMarc requested a review from IlyasMoutawwakil July 15, 2025 12:54
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for the new tests !

@SunMarc SunMarc merged commit 847ae58 into huggingface:main Jul 15, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants