Feat: context parallel v2.0 #3700

S1ro1 · 2025-07-31T19:02:26Z

Integrates CP seamlessly with previously merged ParallelismConfig. Builds on top of #3604 (which it supersedes) and so on.

S1ro1 · 2025-07-31T19:12:54Z

Supersedes #3604 as I can't bother with fixing git on that branch.

HuggingFaceDocBuilderDev · 2025-07-31T19:16:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Nice job ! Thanks for integrating CP using parallelism_config ! Left a couple of comments

SunMarc · 2025-08-04T10:43:06Z

docs/source/concept_guides/context_parallelism.md

+- `no_restore_buffers`: The implementation of context parallelism modifies the buffers in-place, converting them to `torch.distributed.tensor.Dtensor`s. After the context manager is exited, a communication kernel would need to be launched to restore the buffers to their original state (usually all-gather). This takes some time, so it is recommended to pass the same tensors as in the `buffers` argument, to avoid unnecessary communication, unless you are sure that you need to use the buffers after the context manager is exited.
+


since this is recommended to pass the same tensor, can't we change the default to that ?

IMO this should be opt-in instead of opt-out as by default we want to leave the buffers unchanged.

SunMarc · 2025-08-04T10:45:33Z

docs/source/concept_guides/context_parallelism.md

+The context manager takes a few arguments, that are used to configure the context parallelism.
+
+- `buffers`: This is a list of tensors that are to be sharded across the sequence dimension. These tensors are usually input ids, labels and attention mask.
+- `buffer_seq_dims`: This is a list of integers, that specify the sequence dimension of the buffers, in the order of the `buffers` list.


can we add more details to that somewhere ? some ppl might not understand what you mean by sequence dimension of the buffers. Maybe in the example ?

SunMarc · 2025-08-04T10:47:04Z

examples/fsdp2/nd_parallel.py

+        # To get the proper loss value, we need to average across devices that are participating in data parallel/context parallel training
+        loss_reduce_grp = (
+            accelerator.torch_device_mesh["dp_cp"].get_group()
+            if accelerator.parallelism_config.dp_cp_dim_names
+            else None
+        )


nice use of the device mesh

SunMarc · 2025-08-04T10:48:09Z

examples/fsdp2/nd_parallel.py

+            state_dict_type="SHARDED_STATE_DICT",
+            activation_checkpointing=True,


why add that ?

mb, was supposed to be removed, this is needed for the 128k seq-len example I was testing so I forgot it in

SunMarc · 2025-08-04T10:50:16Z

src/accelerate/accelerator.py

+        # TODO: Siro - figure out a better place where this can go (needs to be above AcceleratorState init)
+        if parallelism_config and parallelism_config.cp_enabled and fsdp_plugin is None:
+            raise ValueError(
+                "`cp_enabled` is set to `True` in the `parallelism_config`, but no `fsdp_plugin` was provided. We need a `fsdp_plugin` to use `cp_enabled=True`, as we also shard the model across the device mesh to save more memory"
+            )
+


why ? Can't we put that in _validate_accelerator ?

In AcceleratorState we already need to set stuff based on cp being enabled, we would get a default fsdp plugin out -> no way to detect if it was not passed in.

EDIT: moved this into acceleratorState (still more viable imo)

src/accelerate/commands/launch.py

src/accelerate/parallelism_config.py

SunMarc · 2025-08-04T11:26:07Z

src/accelerate/state.py

+                    self.parallelism_config is not None and self.parallelism_config.cp_enabled
+                ):


not sure this is needed no ? in which case fsdp is not used but still we set self.distributed_type = DistributedType.FSDP

tests/test_dataclasses.py

…late loss

SunMarc

Nice ! LGTM !

SunMarc · 2025-08-04T15:50:27Z

src/accelerate/state.py

+                if self.parallelism_config and self.parallelism_config.cp_enabled and fsdp_plugin is None:
+                    raise ValueError(
+                        "`cp_size > 1` in the `parallelism_config`, but no `fsdp_plugin` was provided. We need a `fsdp_plugin` to use `cp_enabled=True`, as we also shard the model across the device mesh to save more memory"
+                    )


cc might not be optimal for downstream lib like axolotl cc @winglian

yeah, we definitely support CP without FSDP, so this would break that feature. Maybe some other sort of explicit setting that a user is enabling accelerate to handle CP for them? @djsaunde

I'm curious why this is gated in the first place, can we not use CP in accelerate sans FSDP? They should be independent.

I'm curious why this is gated in the first place, can we not use CP in accelerate sans FSDP? They should be independent.

They are (of sorts) but FSDP is a free lunch with CP, so imo it should be the default. While we're computing the ring-attention we can prefetch next fsdp_layer for free, giving us 1/cp_size*fsdp_size savings in model/optimizer/grads. I have some profiling for this in the concept guide.

TLDR: it can be independent, but there's (almost) no world where it's worth to not do FSDP on top.

winglian · 2025-08-04T15:59:08Z

src/accelerate/accelerator.py

@@ -1497,6 +1502,9 @@ def prepare(self, *args, device_placement=None):
        if self.parallelism_config and self.parallelism_config.tp_enabled:
            args = self._prepare_tp(*args)

+        if self.parallelism_config and self.parallelism_config.cp_enabled:
+            args = self._prepare_cp(*args)


I am worried that in Axolotl, that automatically handling this might break existing context-parallel handling we have. @djsaunde

Yeah, I think it will. ring-flash-attn (what we use) supports non-causal masks so e.g. the mask deletion / replacement with causal=True is not good for us. We could patch over this maybe? I like the way pre-hooks are handled here in the accelerator, so we could swap to that vs. our context manager.

You can probably just delete the hook? It's gonna be the first one always, as I use prepend=True so should be pretty simple! The hook is the only thing we add for cp to work, then you only would need to use the context manager (which you won't)

winglian · 2025-08-04T16:01:30Z

src/accelerate/parallelism_config.py

+            self.dp_replicate_size = int(os.environ.get("PARALLELISM_CONFIG_DP_REPLICATE_SIZE", "1"))
+        if self.dp_shard_size is None:
+            self.dp_shard_size = int(os.environ.get("PARALLELISM_CONFIG_DP_SHARD_SIZE", "1"))
+        if self.tp_size is None:
+            self.tp_size = int(os.environ.get("PARALLELISM_CONFIG_TP_SIZE", "1"))
+        if self.cp_size is None:
+            self.cp_size = int(os.environ.get("PARALLELISM_CONFIG_CP_SIZE", "1"))


aren't most accelerate env vars prefixed with ACCELERATE_?

oh yes good catch, can you fix that @S1ro1 ?

Not entirely true actually, paradigm variables are prefixed with ACCELERATE (such as use_fsdp, use_deepspeed), then variables configuring the underlying implementations are prefixed with the impl name (i.e. FSDP_, MEGATRON_LM_) etc. Such as here or here

Afaik only DeepSpeed is special and prefixes with ACCELERATE_DEEPSPEED_ making the env variables insanely long

djsaunde · 2025-08-04T16:20:33Z

src/accelerate/big_modeling.py

+    This function attaches forward_pre_hooks to each self_attn module of the model, where each hook checks the
+    args/kwargs, if they contain an attention mask, if it does, it will remove this mask, check if it is a causal mask,
+    if yes, will add a kwarg `is_causal=True`, otherwise will raise an error. This is because context parallelism does
+    not support attention masks. This function modifies the model in place.


Nice, I was wondering how to handle this myself.

Oh, I see you don't actually check whether the mask is causal yet.

yes we should do that eventually, will revisit, currently we only warn in docs.

winglian · 2025-08-04T16:41:17Z

src/accelerate/utils/launch.py

+    if not args.use_parallelism_config:
+        return current_env
+
+    prefix = "PARALLELISM_CONFIG_"


Suggested change

prefix = "PARALLELISM_CONFIG_"

prefix = "ACCELERATE_PARALLELISM_CONFIG_"

Similar to comment above about accelerate using namespaced env vars

Cleanup: context parallel

314ccc7

S1ro1 force-pushed the feat/context-parallel branch from 35c24ff to 314ccc7 Compare July 31, 2025 19:12

S1ro1 added 8 commits July 31, 2025 19:55

Feat: cleanup

43dae9e

Feat: concept guide

3dd3c74

Fix: rename + version check

514bf17

Style

d51010c

Fix: add to namespace in a test

7263aaa

Fix: add skip_if on dataclass tests

c2eb4ca

Fix: proper version for version check

352873f

Feat: add tests and cleanup

bcc3b29

S1ro1 changed the title ~~[WIP] Feat/context parallel v2.0~~ Feat: context parallel v2.0 Aug 4, 2025

Fix: properly version check added tests

1188e8b

SunMarc reviewed Aug 4, 2025

View reviewed changes

SunMarc requested a review from winglian August 4, 2025 11:32

S1ro1 added 3 commits August 4, 2025 13:44

Feat: address comments

1daa26b

Fix: add both shift_labels and labels to make the model.forward calcu…

1593bc9

…late loss

Fix: remove import, improve comment

e7dc541

SunMarc approved these changes Aug 4, 2025

View reviewed changes

SunMarc reviewed Aug 4, 2025

View reviewed changes

winglian reviewed Aug 4, 2025

View reviewed changes

djsaunde reviewed Aug 4, 2025

View reviewed changes

winglian reviewed Aug 4, 2025

View reviewed changes

		- `no_restore_buffers`: The implementation of context parallelism modifies the buffers in-place, converting them to `torch.distributed.tensor.Dtensor`s. After the context manager is exited, a communication kernel would need to be launched to restore the buffers to their original state (usually all-gather). This takes some time, so it is recommended to pass the same tensors as in the `buffers` argument, to avoid unnecessary communication, unless you are sure that you need to use the buffers after the context manager is exited.

		state_dict_type="SHARDED_STATE_DICT",
		activation_checkpointing=True,

		self.parallelism_config is not None and self.parallelism_config.cp_enabled
		):

	prefix = "PARALLELISM_CONFIG_"
	prefix = "ACCELERATE_PARALLELISM_CONFIG_"

Feat: context parallel v2.0 #3700

Are you sure you want to change the base?

Feat: context parallel v2.0 #3700

Conversation

S1ro1 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

S1ro1 commented Jul 31, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 31, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djsaunde Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

S1ro1 commented Jul 31, 2025 •

edited

Loading

S1ro1 Aug 4, 2025 •

edited

Loading

djsaunde Aug 4, 2025 •

edited

Loading

S1ro1 Aug 4, 2025 •

edited

Loading

S1ro1 Aug 4, 2025 •

edited

Loading