Perf scripts updates #14005

guyueh1 · 2025-06-24T18:51:23Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Performance scripts updates

Collection: [Note which collection this PR will affect]

Changelog

Use TORCH_NCCL_HIGH_PRIORITY=1 in perf scripts
MXFP8: Enable TP comm overlap
MXFP8: Enable sharing RS buffer for param AG
Use full layer spec for GPT

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Guyue Huang <[email protected]>

…ther for MXFP8 Signed-off-by: Guyue Huang <[email protected]>

Signed-off-by: Guyue Huang <[email protected]>

Signed-off-by: guyueh1 <[email protected]>

malay-nagda · 2025-06-25T06:41:42Z

scripts/performance/helpers.py

+    # because it is not supported with reuse_grad_buf_for_mxfp8_param_ag
+    if compute_dtype.lower() == "fp8" and fp8_recipe.lower() == "mxfp8":
+        recipe.trainer.strategy.ddp.reuse_grad_buf_for_mxfp8_param_ag = True
+        recipe.trainer.strategy.ddp.overlap_param_gather = False


I don't think we can control/override this config...

It will be set if DP > 1 in MegatronCommOverlapCallback- https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L188

cc: @JimmyZhang12

updated the code to use MegatronCommCallback to set it, I think that will override the inferred config.

Signed-off-by: Guyue Huang <[email protected]>

…erf_scripts_updates

Signed-off-by: guyueh1 <[email protected]>

Signed-off-by: Guyue Huang <[email protected]>

…erf_scripts_updates

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 · 2025-07-03T15:58:47Z

scripts/performance/executors.py

@@ -66,6 +66,7 @@ def slurm_executor(
        "NVTE_FLASH_ATTN": "1",  # Enable Flash Attention, which is needed to enable cuDNN fused attention
        "NVTE_FUSED_ATTN": "1",  # Enable cuDNN fused attention
        "NEMO_LOG_MEMORY_USAGE": "1",  # Print memory allocation
+        "TORCH_NCCL_HIGH_PRIORITY": "1",  # Enable high priority for NCCL communication in pytorch


@guyueh1 Can we move this to here?
https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/run/plugins.py#L387-L406

moved there; do we need any condition on applying that envvar, or always applying (as I do now)?

I think always applying should be fine. Had it enabled for all recent workloads I ran...

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 added 5 commits June 24, 2025 11:37

Use TORCH_NCCL_HIGH_PRIORITY=1 in perf scripts

74b79e7

Signed-off-by: Guyue Huang <[email protected]>

Enable TP overlap for MXFP8

79c5a9f

Signed-off-by: Guyue Huang <[email protected]>

Enable use_transformer_engine_full_layer_spec for GPT again

79960db

Signed-off-by: Guyue Huang <[email protected]>

Enable reuse_grad_buf_for_mxfp8_param_ag and disable overlap_param_ga…

548737c

…ther for MXFP8 Signed-off-by: Guyue Huang <[email protected]>

Formatting

0025561

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 requested a review from erhoo82 June 24, 2025 18:51

guyueh1 added Run CICD r2.4.0 Pick this label for auto-cherry-picking into r2.4.0 labels Jun 24, 2025

Merge branch 'main' into perf_scripts_updates

5b16091

ko3n1g added Run CICD and removed Run CICD labels Jun 24, 2025

Apply isort and black reformatting

b47719f

Signed-off-by: guyueh1 <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Jun 24, 2025

malay-nagda reviewed Jun 25, 2025

View reviewed changes

guyueh1 added 4 commits June 25, 2025 08:42

Rewrite disabling overlap_param_gather

c50a3f4

Signed-off-by: Guyue Huang <[email protected]>

Enabling TP overlap with fsdp back, don't seem to cause an issue

02a0e6a

Signed-off-by: Guyue Huang <[email protected]>

Formatting

6f250ba

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'perf_scripts_updates' of github.com:guyueh1/NeMo into p…

98c662b

…erf_scripts_updates

ko3n1g added Run CICD and removed Run CICD labels Jun 25, 2025

Set reuse_grad_buf_for_mxfp8_param_ag in OptimConfig

6d14d1c

Signed-off-by: guyueh1 <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Jun 25, 2025

malay-nagda previously approved these changes Jun 26, 2025

View reviewed changes

guyueh1 added 3 commits June 26, 2025 08:08

Add subchannel scaling option to perf script

089aa1f

Signed-off-by: Guyue Huang <[email protected]>

Use fp8 param gather for mxfp8 precision

4a1dc2e

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'perf_scripts_updates' of github.com:guyueh1/NeMo into p…

74aa3eb

…erf_scripts_updates

guyueh1 dismissed malay-nagda’s stale review via 74aa3eb June 26, 2025 15:11

ko3n1g removed the Run CICD label Jun 26, 2025

github-actions bot added the NLP label Jul 1, 2025

ko3n1g added Run CICD and removed Run CICD labels Jul 1, 2025

Enable average_in_collective for etp!=tp in mixtral8x22 perf script

475da36

Signed-off-by: Guyue Huang <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Jul 1, 2025

Pylint

54b5de2

Signed-off-by: Guyue Huang <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Jul 1, 2025

ko3n1g temporarily deployed to test July 1, 2025 22:03 — with GitHub Actions Inactive

Fix the get_layer_offset logic in full te layer

e5f0eb5

Signed-off-by: Guyue Huang <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Jul 2, 2025

ko3n1g temporarily deployed to test July 2, 2025 16:07 — with GitHub Actions Inactive

malay-nagda previously approved these changes Jul 3, 2025

View reviewed changes

malay-nagda added RC4 Run CICD and removed Run CICD labels Jul 3, 2025

malay-nagda had a problem deploying to test July 3, 2025 12:20 — with GitHub Actions Error

guyueh1 enabled auto-merge (squash) July 3, 2025 15:39

erhoo82 reviewed Jul 3, 2025

View reviewed changes

Move TORCH_NCCL_HIGH_PRIORITY to nemo/lightning/run/plugins.py

1830c01

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 dismissed malay-nagda’s stale review via 1830c01 July 3, 2025 16:21

ko3n1g added Run CICD and removed Run CICD labels Jul 3, 2025

ko3n1g had a problem deploying to test July 3, 2025 16:23 — with GitHub Actions Error

erhoo82 approved these changes Jul 3, 2025

View reviewed changes

erhoo82 added Run CICD and removed Run CICD labels Jul 3, 2025

erhoo82 requested a deployment to test July 3, 2025 17:31 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf scripts updates #14005

Perf scripts updates #14005

Uh oh!

guyueh1 commented Jun 24, 2025

Uh oh!

malay-nagda Jun 25, 2025 •

edited

Loading

Uh oh!

guyueh1 Jun 25, 2025

Uh oh!

erhoo82 Jul 3, 2025

Uh oh!

guyueh1 Jul 3, 2025

Uh oh!

malay-nagda Jul 3, 2025

Uh oh!

Uh oh!

Perf scripts updates #14005

Are you sure you want to change the base?

Perf scripts updates #14005

Uh oh!

Conversation

guyueh1 commented Jun 24, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

malay-nagda Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guyueh1 Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

erhoo82 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

guyueh1 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

malay-nagda Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malay-nagda Jun 25, 2025 •

edited

Loading