Skip to content

Split tutorials to 3 groups #4220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 16, 2025
Merged

Split tutorials to 3 groups #4220

merged 4 commits into from
May 16, 2025

Conversation

pbchekin
Copy link
Contributor

@pbchekin pbchekin commented May 15, 2025

Fixes #3820.

The run time reduced from 35m to 23m. Now "minicore" is in critical path.

04-low-memory-dropout
05-layer-norm
07-extern-functions
09-persistent-matmul
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the CI time, do we want to move 09-persistent-matmul to mxfp? rest is still the bottleneck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the new CI result, It is hard to balance, looks like 09-persistent-matmul takes a long time, maybe 06-fused-attention to rest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 09 is the slowest. I will try to re-balance.

571.62 09-persistent-matmul
425.86 06-fused-attention
188.46 08-grouped-gemm
143.12 10-experimental-block-pointer
80.68  10i-experimental-block-pointer
76.78  03-matrix-multiplication
62.03  03i-matrix-multiplication
47.10  05-layer-norm
33.16  02-fused-softmax
11.78  04-low-memory-dropout
8.64   01-vector-add
7.09   07-extern-functions

Copy link
Contributor Author

@pbchekin pbchekin May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last run is under 25 minutes, minicore (lts and rolling) is the slowest part now. We can optimize the run time further with adding a new parallel job, splitting minicore, and balancing the workload among the jobs (not in this PR).

UPD: minicore (lts) and scaled_dot (rolling) are both ~17m and are in the critical path now. Could not make if faster with rebalancing tutorials, so the conclusion is the same.

pbchekin added 4 commits May 16, 2025 09:54
Signed-off-by: Pavel Chekin <[email protected]>
Signed-off-by: Pavel Chekin <[email protected]>
Signed-off-by: Pavel Chekin <[email protected]>
@pbchekin pbchekin merged commit 986459a into main May 16, 2025
15 checks passed
@pbchekin pbchekin deleted the split-tutorials branch May 16, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Ideas to reduce PR build and test time
3 participants