Skip to content

Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) #3682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 76 commits into from
Jul 30, 2025

Conversation

SalmanMohammadi
Copy link
Contributor

@SalmanMohammadi SalmanMohammadi commented Jul 15, 2025

What does this PR do?

Building on #3651

Dependencies:

  • Fix TP logic in Transformers
  • Support ParallelismConfig in HFTrainer

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

cc @S1ro1
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Salman Mohammadi added 2 commits July 21, 2025 12:13
@SalmanMohammadi SalmanMohammadi changed the title [WIP] Parallelism config + BYODM (Bring Your Own Device Mesh) [WIP] Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) Jul 21, 2025
@winglian winglian force-pushed the device_mesh_parallelism_config branch from d9aec5c to a402faf Compare July 26, 2025 13:48
Comment on lines 2775 to 2777
clip_context_manager = implicit_replication
else:
clip_context_manager = contextlib.nullcontext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ! We will be able to clean up a bit trainer code after that

def build_device_mesh(self, device_type: str):
mesh = self.get_mesh()
if not len(list(mesh)):
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should probably raise an error but tbh we don't really need to deal with this case

@@ -2,6 +2,23 @@

This folder contains examples of using FSDP2 with Accelerate, utilizing extra methods to improve training speed, performance or accuracy.

### FSDP2 + ND Parallelism

With `AccelerateDistributedConfig`, you can use 🤗 accelerate to train with n-dimensional parallelism. Script `nd_parallel.py` showcases just how you can do it. We enable you to configure 3 different parallel dimensions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean ParallelConfig no ?

Suggested change
With `AccelerateDistributedConfig`, you can use 🤗 accelerate to train with n-dimensional parallelism. Script `nd_parallel.py` showcases just how you can do it. We enable you to configure 3 different parallel dimensions:
With `ParallelConfig`, you can use 🤗 accelerate to train with n-dimensional parallelism. Script `nd_parallel.py` showcases just how you can do it. We enable you to configure 3 different parallel dimensions:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have both because of the duplicate config upstream in transformers - but it would be good to clarify which to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to use the one from accelerate

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Just a few nits

<Tip>
Only use TP intra-node - therefore max TP size you should need is 8, you can also lower this as FSDP (`--dp-shard-size`) can be faster on smaller models with
shorter sequence lengths. If you still cannot fit into memory, utilize `--dp-shard-size` as much as you can. Then to scale up to utilize all your GPUs, fill the rest
with `--dp-replicate-size`. This is only a general guideline, you can (and should) experiment with different parallelism configurations to find the best one for your model and hardware. You can learn more about the general strategies for parallelism in our [blog](TODO) or if you wanna dive deep, read the [Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove TODO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the blog isn't ready, so we kind of need to keep the todo there haha (we'll finish before release)

Comment on lines 752 to 753
def parallelism_config(self) -> ParallelismConfig | None:
return self.state.parallelism_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| syntax only works with py3.10 but we still need to support py3.9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will drop py3.9 in october btw !

@SunMarc SunMarc mentioned this pull request Jul 30, 2025
@S1ro1 S1ro1 merged commit 9359a01 into huggingface:main Jul 30, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants