[DONT MERGE] Debug branch for the Qwen3 + full model compile #2831

anijain2305 · 2025-06-16T18:13:17Z

Take this PR. This just turns of compile and full model compilation, and then sets up the config values that you finish the program quickly. Use the instructions at the beginning of this file - https://github.com/pytorch/torchtune/blob/main/recipes/configs/qwen3/8B_full_single_device.yaml

torch.compile takes 198 seconds to fine-tune (with warmup). Eager takes only 48 seconds. If we do regional compile (compile only the transformer layers), it takes 35 seconds. So, something really bad is going on.

pytorch-bot · 2025-06-16T18:13:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2831

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Cancelled Jobs

As of commit 39b9193 with merge base 2344509 ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume_from_distributed_checkpoint_multi_rank[llama3/8B_full-llama3_hf_138m-1-4-False]
Lint / lint (3.10) (gh)
torchtune/training/_compile.py:13:1: F401 'torchtune.modules.TransformerSelfAttentionLayer' imported but unused

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume_from_distributed_checkpoint_multi_rank[llama3/8B_full-llama3_hf_138m-1-4-False]
GPU tests / gpu_test (3.9, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume_from_distributed_checkpoint_multi_rank[llama3/8B_full-llama3_hf_138m-1-4-False]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Use the instructions at the beginning of this file - https://github.com/pytorch/torchtune/blob/main/recipes/configs/qwen3/8B_full_single_device.yaml

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2025

anijain2305 mentioned this pull request Jun 16, 2025

[compile][torchtune] Full model compiled Qwen3 is 4x slower than eager pytorch/pytorch#156103

Open

Debug branch for the Qwen3

39b9193

Use the instructions at the beginning of this file - https://github.com/pytorch/torchtune/blob/main/recipes/configs/qwen3/8B_full_single_device.yaml

anijain2305 force-pushed the debug-full-compile branch from 434d559 to 39b9193 Compare June 16, 2025 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DONT MERGE] Debug branch for the Qwen3 + full model compile #2831

[DONT MERGE] Debug branch for the Qwen3 + full model compile #2831

Uh oh!

anijain2305 commented Jun 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

[DONT MERGE] Debug branch for the Qwen3 + full model compile #2831

Are you sure you want to change the base?

[DONT MERGE] Debug branch for the Qwen3 + full model compile #2831

Uh oh!

Conversation

anijain2305 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2831

❌ 2 New Failures, 2 Cancelled Jobs

Uh oh!

Uh oh!

anijain2305 commented Jun 16, 2025 •

edited

Loading

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading