Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring #3677

pstjohn · 2025-07-10T22:19:12Z

What does this PR do?

It looks like the current fp8 tests are not passing, this PR slightly refactors these tests and makes a few fixes to get them to pass.

It also adds (and fixes) new tests that ensure FP8 functionality can be configured entirely from an accelerate config yaml in the cases where the user doesn't have the ability to change how the Accelerator() object is created. This currently seems broken, since you need to be able to pass a kwargs_handlers to choose the FP8 backend. With the transformers.Trainer class, for instance, the Accelerator() object is created under-the-hood, so this should enable FP8 training with that class simply by changing the accelerator config yaml.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr
@zach-huggingface

All tests pass locally with the huggingface/accelerate:gpu-fp8-transformerengine-nightly container after installing deepspeed, with the exception of

FAILED tests/test_metrics.py::MetricTester::test_metric_accelerator_multi - RuntimeError: 'accelerate launch --num_processes=2 --monitor_interval=0.1 /workspaces/accelerate/src/accelerate/test_utils/scripts/external_deps/test_metrics.py' failed with returncode 1

stderr: [rank1]:   File "/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed/glue.py", line 84, in simple_accuracy
stderr: [rank1]:     return float((preds == labels).mean())
stderr: [rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^
stderr: [rank1]: AttributeError: 'bool' object has no attribute 'mean'

but, that test fails in the same way for me on main, so I don't think it's related to this change.

S1ro1 · 2025-07-11T09:39:00Z

Thank you, very happy to iterate on this when I get back from vacation, until then cc @SunMarc for general design.

All tests pass locally with the huggingface/accelerate:gpu-fp8-transformerengine-nightly container after installing deepspeed, with the exception of

Yes, that test was broken on yesterday's CI because of a BC in datasets, should work now.

pstjohn · 2025-07-11T12:28:31Z

benchmarks/fp8/transformer_engine/Dockerfile

@@ -7,7 +7,7 @@ RUN pip install transformers evaluate datasets
 RUN git clone https://github.com/huggingface/accelerate.git

 RUN cd accelerate && \
-    pip install -e . && \
+    pip install -e .[deepspeed] && \


this is the container used for fp8 CI tests, which include deepspeed. so even though we don't use it in these benchmark scripts (which we should double-check are still functional 😄), this allows the requires_deepspeed tests to run in the fp8 tests.

pstjohn · 2025-07-11T12:29:14Z

examples/config_yaml_templates/fp8.yaml

@@ -11,8 +11,8 @@ fp8_config:
  fp8_format: E4M3
  interval: 1
  margin: 0
-  override_linear_precision: (false, false, false)
+  override_linear_precision: [false, false, false]


this isn't exercised in CI anywhere but I caught this bug while using this to debug locally 🤷 . The () expression just gets evaluated to a string rather than a list

pstjohn · 2025-07-11T12:30:14Z

src/accelerate/accelerator.py

@@ -33,6 +33,8 @@
 import torch.utils.hooks as hooks
 from huggingface_hub import split_torch_state_dict_into_shards

+from accelerate.utils.dataclasses import FP8BackendType


we already had this enum, so I figured it was worth using this here instead of the string comparisons

pstjohn · 2025-07-11T12:31:14Z

src/accelerate/accelerator.py

-            # Prioritize AO -> TE -> MSAMP
-            if is_torchao_available():
-                logger.info("Found `torchao` installed, using it for FP8 training.")
+            if self.fp8_backend == FP8BackendType.AO:


Here we first defer to the fp8_backend specified in the yaml, and only if we didn't specify one do we revert to the AO -> TE -> MSAMP preference.

pstjohn · 2025-07-11T12:33:59Z

tests/test_fp8.py

+    if not args.test_te and not args.test_ao:
+        raise ValueError("Must specify at least one of --test_te or --test_ao")
+
+    if args.test_te:


rather than checking if TE is available twice (once when dispatching the test and once inside the test), we just check once when deciding what tests to run. This way we don't end up running the same tests twice if we have both TE and AO installed; and the failures are more fine-grained.

djsaunde · 2025-07-11T17:15:16Z

src/accelerate/accelerator.py

+        if (
+            self.fp8_backend == FP8BackendType.AO
+            and self.state.distributed_type == DistributedType.FSDP
+            and self.state.fsdp_plugin.cpu_ram_efficient_loading
+        ):


I was about to submit a PR that made this exact change :)

Related: huggingface/transformers#39370

yeah IIRC this was where some of the existing fp8 tests were failing

SunMarc

Nice job fixing these ! Really appreciate it !

HuggingFaceDocBuilderDev · 2025-07-15T12:57:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil

LGTM ! Thanks for the new tests !

pstjohn added 3 commits July 10, 2025 14:31

single-gpu tests passing

33b1142

install deepspeed in fp8 container

a297b4e

revert mixed_precision check

badb7f3

pstjohn marked this pull request as ready for review July 11, 2025 01:12

pstjohn commented Jul 11, 2025

View reviewed changes

djsaunde reviewed Jul 11, 2025

View reviewed changes

SunMarc approved these changes Jul 15, 2025

View reviewed changes

SunMarc requested a review from IlyasMoutawwakil July 15, 2025 12:54

IlyasMoutawwakil approved these changes Jul 15, 2025

View reviewed changes

SunMarc merged commit 847ae58 into huggingface:main Jul 15, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring #3677

Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring #3677

Uh oh!

pstjohn commented Jul 10, 2025 •

edited

Loading

Uh oh!

S1ro1 commented Jul 11, 2025

Uh oh!

pstjohn Jul 11, 2025

Uh oh!

pstjohn Jul 11, 2025

Uh oh!

SunMarc Jul 15, 2025

Uh oh!

pstjohn Jul 11, 2025

Uh oh!

pstjohn Jul 11, 2025

Uh oh!

pstjohn Jul 11, 2025

Uh oh!

djsaunde Jul 11, 2025

Uh oh!

djsaunde Jul 11, 2025

Uh oh!

pstjohn Jul 11, 2025

Uh oh!

SunMarc left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 15, 2025

Uh oh!

IlyasMoutawwakil left a comment

Uh oh!

Uh oh!

Uh oh!

Fix FP8 tests, enable FP8 to be used without direct Accelerator() configuring #3677

Fix FP8 tests, enable FP8 to be used without direct Accelerator() configuring #3677

Uh oh!

Conversation

pstjohn commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

S1ro1 commented Jul 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 15, 2025

Uh oh!

IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring #3677

Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring #3677

pstjohn commented Jul 10, 2025 •

edited

Loading