SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter #130583

yifuwang · 2024-07-11T23:17:47Z

Stack from ghstack (oldest at bottom):

-> SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter #130583

# NOTE [low-contention collectives]
# When a collective is overlapped with abundant compute, it makes sense to
# prioritize reducing the contention between the collective and the overlapped
# compute, even at the cost of a slightly slower collective.
#
# Common collective implementations (e.g., NCCL without user buffer
# registration) optimize for throughput with no ambient compute. However, such
# implementations may not be optimal when they are overlapped with compute:
# - These impls typically fuse the entire collective into a single kernel and
# reserve SM resources based on the most demanding portion of the collective,
# even when a large portion of the collective does not require this much
# resource.
# - These implementations typically fuse the entire collective into a single
# kernel and reserve SM resources based on the most demanding portion of the
# collective, even when a large portion of the collective does not require this
# much resource.
# - These implementations often use SM-based P2P copy as opposed to copy
# engine-based P2P copy. Copy engine-based P2P copy may not have a significant
# advantage when there's no ambient compute. However, it may significantly
# improve overall resource utilization in the presence of ambient compute.
#
# When overlapped with intensive compute (e.g., persistent matmul kernels), the
# SM-usage of a collective can lead to inefficient overlapping.
#
# Low-contention collectives achieve their goals with the following strategies:
# - Use copy engine-based copy whenever possible.
# - Break down portions of a collective with different resource requirements
# into multiple kernels. This improves the overlapping efficiency at the cost
# of additional launching overhead.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

…e-scatter [ghstack-poisoned]

pytorch-bot · 2024-07-11T23:17:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130583

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a766c55 with merge base 1624798 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (disabled by #131136)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportNonStrictTestExport::test_slice_with_floordiv_training_ir_to_decomp_non_strict

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…e-scatter ghstack-source-id: d817675 Pull Request resolved: #130583

weifengpy

exciting direction! I find it really helpful to document motivation inside the code

noob questions

is "Copy Engine-Based P2P Copy" for inter-node or intra-node ? seems the unit test covers intra-node?
for 1D FSDP, last time you suggested splitting NCCL kernels into inter and intra. I guess this is the intra part with "GPUDirect P2P" (aka "Copy Engine-Based P2P Copy") ?
is it easy to create SM contention in unit test by overlapping matmul with nccl collectives? (not a blocker for this PR)

torch/distributed/_symmetric_memory/__init__.py

awgu · 2024-07-12T15:53:39Z

torch/distributed/_symmetric_memory/__init__.py

+        - An extra SM-based copy is performed to copy the input data into the
+          symmetric memory workspace.


For FSDP2, could we directly allocate AG input/output and RS input/output in symmetric memory?

pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py

Lines 77 to 82 in 18418a7

all_gather_output = torch.empty(

(all_gather_input_numel * world_size,), dtype=dtype, device=device

)

all_gather_input = all_gather_output.narrow(

0, all_gather_input_numel * rank, all_gather_input_numel

)

pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py

Lines 265 to 268 in 18418a7

reduce_scatter_input = torch.empty(

(reduce_scatter_input_numel,), dtype=reduce_dtype, device=device

)

foreach_reduce_scatter_copy_in(unsharded_grads, reduce_scatter_input, world_size)

pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py

Line 274 in 18418a7

reduce_output = reduce_scatter_input.new_empty((reduce_scatter_output_numel,))

It looks doable because we pre-allocate these as torch.empty or similar.

The part I am not clear on is how caching allocator works with symmetric memory.

AG input/output is allocated in a separate stream (all_gather_copy_in_stream) in the normal code path.

RS input is allocated in default/current stream.

RS output is allocated in a separate stream (reduce_scatter_stream).

Sorry, I missed that these symm. mem-based collectives follow the funcol signature. I think that should not be a problem since the stream synchronization should ensure correctness regardless.

The only thing to watch out for is that in our current implementation, the AG input is a view into the AG output.

…r and reduce-scatter" ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…e-scatter ghstack-source-id: ee344db Pull Request resolved: #130583

…r and reduce-scatter" ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…e-scatter ghstack-source-id: fd336b9 Pull Request resolved: #130583

yifuwang · 2024-07-17T19:34:45Z

@pytorchbot merge

pytorchmergebot · 2024-07-17T19:36:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-18T01:35:13Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

…r and reduce-scatter" ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…e-scatter ghstack-source-id: a28a292 Pull Request resolved: #130583

yifuwang · 2024-07-22T23:41:57Z

@pytorchbot merge

pytorchmergebot · 2024-07-22T23:43:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-22T23:43:37Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

…r and reduce-scatter" ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…e-scatter ghstack-source-id: d46ebfd Pull Request resolved: #130583

yifuwang · 2024-07-23T20:31:46Z

@pytorchbot merge

pytorchmergebot · 2024-07-23T20:33:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

yifuwang · 2024-07-23T22:08:31Z

@pytorchbot merge

pytorchmergebot · 2024-07-23T22:08:48Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-07-23T22:10:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…e-scatter (#130583) ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` Pull Request resolved: #130583 Approved by: https://github.com/weifengpy (cherry picked from commit 161c18e)

…e-scatter ghstack-source-id: a28a292 Pull Request resolved: pytorch#130583

…e-scatter (pytorch#130583) ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` Pull Request resolved: pytorch#130583 Approved by: https://github.com/weifengpy

SymmetricMemory-based, low contention intra-node all-gather and reduc…

4beb220

…e-scatter [ghstack-poisoned]

yifuwang mentioned this pull request Jul 11, 2024

[Functional Collective] enable custom work registration from python #130354

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 11, 2024

yifuwang pushed a commit that referenced this pull request Jul 11, 2024

SymmetricMemory-based, low contention intra-node all-gather and reduc…

1aed9d6

…e-scatter ghstack-source-id: d817675 Pull Request resolved: #130583

yifuwang requested review from wanchaol, Chillee, awgu and weifengpy July 11, 2024 23:19

weifengpy approved these changes Jul 11, 2024

View reviewed changes

awgu reviewed Jul 12, 2024

View reviewed changes

torch/distributed/_symmetric_memory/__init__.py Outdated Show resolved Hide resolved

awgu reviewed Jul 12, 2024

View reviewed changes

yifuwang pushed a commit that referenced this pull request Jul 15, 2024

SymmetricMemory-based, low contention intra-node all-gather and reduc…

7f2bf5a

…e-scatter ghstack-source-id: ee344db Pull Request resolved: #130583

yifuwang pushed a commit that referenced this pull request Jul 17, 2024

SymmetricMemory-based, low contention intra-node all-gather and reduc…

a37bc03

…e-scatter ghstack-source-id: fd336b9 Pull Request resolved: #130583

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 17, 2024

pytorchmergebot added the merging label Jul 17, 2024

yifuwang pushed a commit that referenced this pull request Jul 19, 2024

SymmetricMemory-based, low contention intra-node all-gather and reduc…

8b60000

…e-scatter ghstack-source-id: a28a292 Pull Request resolved: #130583

pytorchmergebot removed the merging label Jul 22, 2024

yifuwang pushed a commit that referenced this pull request Jul 23, 2024

SymmetricMemory-based, low contention intra-node all-gather and reduc…

ef3fd5b

…e-scatter ghstack-source-id: d46ebfd Pull Request resolved: #130583

pytorchmergebot added the merging label Jul 23, 2024

pytorchmergebot added the Merged label Jul 23, 2024

pytorchmergebot closed this in 161c18e Jul 23, 2024

pytorchmergebot removed the merging label Jul 23, 2024

yifuwang pushed a commit to yifuwang/pytorch that referenced this pull request Jul 24, 2024

SymmetricMemory-based, low contention intra-node all-gather and reduc…

1cbe057

…e-scatter ghstack-source-id: a28a292 Pull Request resolved: pytorch#130583

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

github-actions bot deleted the gh/yifuwang/105/head branch August 23, 2024 02:02

YouJiacheng mentioned this pull request Jun 3, 2025

Remove all_reduce altogether and shard the optimizer(new WR) KellerJordan/modded-nanogpt#102

Merged

		- An extra SM-based copy is performed to copy the input data into the
		symmetric memory workspace.

	all_gather_output = torch.empty(
	(all_gather_input_numel * world_size,), dtype=dtype, device=device
	)
	all_gather_input = all_gather_output.narrow(
	0, all_gather_input_numel * rank, all_gather_input_numel
	)

	reduce_scatter_input = torch.empty(
	(reduce_scatter_input_numel,), dtype=reduce_dtype, device=device
	)
	foreach_reduce_scatter_copy_in(unsharded_grads, reduce_scatter_input, world_size)

SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter #130583

SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter #130583

Conversation

yifuwang commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130583

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

weifengpy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

awgu Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

yifuwang commented Jul 17, 2024

Uh oh!

pytorchmergebot commented Jul 17, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

yifuwang commented Jul 22, 2024

Uh oh!

pytorchmergebot commented Jul 22, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 22, 2024

Merge failed

Uh oh!

yifuwang commented Jul 23, 2024

Uh oh!

pytorchmergebot commented Jul 23, 2024

Merge started

Uh oh!

yifuwang commented Jul 23, 2024

Uh oh!

pytorchmergebot commented Jul 23, 2024

Uh oh!

pytorchmergebot commented Jul 23, 2024

Merge started

Uh oh!

Uh oh!

yifuwang commented Jul 11, 2024 •

edited

Loading

pytorch-bot bot commented Jul 11, 2024 •

edited

Loading