Add support for SPMD PP for deepseek decoder block #1687

gobbleturk · 2025-05-05T19:47:33Z

Replaced by #1766

Description

Add support for using PP with deepseek, including with the new feature pipeline_parallel_layers which only pipelines a subset of layers. This change can help out with SPMD pipelining since PP must divide the number of layers. E.g. for deepseek there are 58 sparse layers, which does not have many friendly divisors.

With this PR we can pipeline just a subset of the sparse layers, e.g. 56 of them and set PP=8. Other layers will be sharded like DP.

Tests

Ran some locally of smaller v2-16B model and added AOT test

DeepSeekV2 16B local on v6e-8 with pdb=1, PP=8, pipeline_subset_layers=24 xprof
With pure FSDP step time is roughly 2x xprof

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

bvandermoon

Quick question to help my understanding while reviewing - why can we only pipeline 56 of the 58 sparse layers for deepseek?

MaxText/pyconfig.py

gobbleturk · 2025-05-05T22:37:52Z

Quick question to help my understanding while reviewing - why can we only pipeline 56 of the 58 sparse layers for deepseek?

we can pipeline 58 but the PP rank must divide the number of layers - 58 isn't a particularly divisor friendly number, I updated this in PR description

bvandermoon

LGTM, just one general question for my understanding

bvandermoon · 2025-05-06T21:16:08Z

MaxText/layers/models.py

+          y, _ = self.scan_decoder_layers(cfg, dense_layer, cfg.first_num_dense_layers, "dense_layers", mesh)(
              y,
              decoder_segment_ids,
              decoder_positions,
              deterministic,
              model_mode,
          )
+          if num_moe_layers_outside_pp > 0:
+            y, _ = self.scan_decoder_layers(cfg, moe_layer, num_moe_layers_outside_pp, "moe_layers", mesh)(
+                y,
+                decoder_segment_ids,
+                decoder_positions,
+                deterministic,
+                model_mode,
+            )
+        y = self.pipeline_module(y, decoder_segment_ids, decoder_positions, deterministic, model_mode, partition_spec=partition_spec)


Just for my understanding, how are the 56 layers that need to be pipelined being specified here? Is it just because they aren't being used in self.scan_decoder_layers?

56 was an example, it can be set via pipeline_parallel_layers which was added in a previous PR

bvandermoon

LGTM pending the test failures

RissyRan

Thanks Matt! One more comment, can we add a unit test in moe_test.py like [this](

maxtext/MaxText/tests/moe_test.py

Line 423 in 0add45d

def test_megablox_expert_parallelism(self):

, to ensure PP works as expected with assertion?

RissyRan · 2025-05-07T23:27:21Z

MaxText/layers/models.py

+        )
+        remaining_layers = self.config.num_decoder_layers - self.config.pipeline_parallel_layers
+        if remaining_layers > 0:
+          logical_axis_rules_pp_as_dp = maxtext_utils.logical_axis_rules_pp_act_as_dp(self.config.logical_axis_rules)


Do you think we could add some flexibility here? Instead of act as dp, could we act based on passed in config (not sure if doable). DP may not be helpful.

For instance, for DS-v3 (61 layers with 58 MoE layers) with pipeline_parallel_layers=56, we could run FSDP or EP from config bellow?

pipeline_parallel_layers=56 && ici_fsdp=-1/ici_ep=-1?

with pipeline_parallel_layers=56 there are only 2 moe layers that will be treated with PP replaced with DP. These two layers will still be sharded by the other sharding strategies - e.g. if config was EP_ICI=16 FSDP_ICI=16 PP_DCN=16, these two layers will be sharded as EP_ICI=16, FSDP_ICI=16 DP_DCN=16, e.g. still the weights are sharded 256 ways

I do like the idea of flexibility, but perhaps we can save this as potential future work?

these two layers will be sharded as EP_ICI=16, FSDP_ICI=16 DP_DCN=16, e.g. still the weights are sharded 256 ways

SG! I thought it will only have DP sharding.

gobbleturk · 2025-05-21T18:27:19Z

I ran into b/418313093 - PP + dropless EP do not work together.

Replacing this one with #1762 to initially support only PP + dropping, sorry to lose the comment history on this one

gobbleturk requested review from khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, rni418, gagika, shralex, yangyuwei, SurbhiJainUSC, hengtaoguo, A9isha, wang2yn84, wyzhang, mitalisi, gpolovets1 and mailvijayasingh as code owners May 5, 2025 19:47

bvandermoon reviewed May 5, 2025

View reviewed changes

MaxText/pyconfig.py Outdated Show resolved Hide resolved

gobbleturk force-pushed the mattdavidow-pipeline-deepseek branch from 23fd9a9 to 03e1524 Compare May 5, 2025 22:34

gobbleturk requested review from jrplatin, patemotter and Lumosis as code owners May 5, 2025 22:34

gobbleturk force-pushed the mattdavidow-pipeline-deepseek branch from 03e1524 to ee35916 Compare May 5, 2025 22:35

gobbleturk assigned bvandermoon and RissyRan May 5, 2025

gobbleturk force-pushed the mattdavidow-pipeline-deepseek branch from ee35916 to 46bbf6d Compare May 5, 2025 22:43

bvandermoon reviewed May 6, 2025

View reviewed changes

gobbleturk force-pushed the mattdavidow-pipeline-deepseek branch from 46bbf6d to 14300d9 Compare May 7, 2025 17:59

bvandermoon approved these changes May 7, 2025

View reviewed changes

RissyRan reviewed May 7, 2025

View reviewed changes

Add support for pipeling deepseek

9f24733

gobbleturk force-pushed the mattdavidow-pipeline-deepseek branch from 14300d9 to 9f24733 Compare May 15, 2025 19:20

PP + EP + dropless = too much tracing

79430d7

gobbleturk requested a review from aireenmei as a code owner May 16, 2025 00:54

TODO: Create a small repro

801a54c

gobbleturk mentioned this pull request May 21, 2025

Add support for PP with dropping DeepSeek #1762

Closed

4 tasks

gobbleturk closed this May 21, 2025

gobbleturk mentioned this pull request May 21, 2025

PP support for DeepSeek #1766

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for SPMD PP for deepseek decoder block #1687

Add support for SPMD PP for deepseek decoder block #1687

Uh oh!

gobbleturk commented May 5, 2025 •

edited

Loading

Uh oh!

bvandermoon left a comment

Uh oh!

Uh oh!

gobbleturk commented May 5, 2025

Uh oh!

bvandermoon left a comment

Uh oh!

bvandermoon May 6, 2025 •

edited

Loading

Uh oh!

gobbleturk May 7, 2025

Uh oh!

bvandermoon left a comment

Uh oh!

RissyRan left a comment

Uh oh!

RissyRan May 7, 2025

Uh oh!

gobbleturk May 8, 2025

Uh oh!

RissyRan May 8, 2025

Uh oh!

gobbleturk commented May 21, 2025

Uh oh!

Uh oh!

Add support for SPMD PP for deepseek decoder block #1687

Add support for SPMD PP for deepseek decoder block #1687

Uh oh!

Conversation

gobbleturk commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

bvandermoon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gobbleturk commented May 5, 2025

Uh oh!

bvandermoon left a comment

Choose a reason for hiding this comment

Uh oh!

bvandermoon May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gobbleturk May 7, 2025

Choose a reason for hiding this comment

Uh oh!

bvandermoon left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan May 7, 2025

Choose a reason for hiding this comment

Uh oh!

gobbleturk May 8, 2025

Choose a reason for hiding this comment

Uh oh!

RissyRan May 8, 2025

Choose a reason for hiding this comment

Uh oh!

gobbleturk commented May 21, 2025

Uh oh!

Uh oh!

gobbleturk commented May 5, 2025 •

edited

Loading

bvandermoon May 6, 2025 •

edited

Loading