-
Notifications
You must be signed in to change notification settings - Fork 62
Split tutorials to 3 groups #4220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04-low-memory-dropout | ||
05-layer-norm | ||
07-extern-functions | ||
09-persistent-matmul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the CI time, do we want to move 09-persistent-matmul
to mxfp
? rest
is still the bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the new CI result, It is hard to balance, looks like 09-persistent-matmul
takes a long time, maybe 06-fused-attention
to rest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 09 is the slowest. I will try to re-balance.
571.62 09-persistent-matmul
425.86 06-fused-attention
188.46 08-grouped-gemm
143.12 10-experimental-block-pointer
80.68 10i-experimental-block-pointer
76.78 03-matrix-multiplication
62.03 03i-matrix-multiplication
47.10 05-layer-norm
33.16 02-fused-softmax
11.78 04-low-memory-dropout
8.64 01-vector-add
7.09 07-extern-functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last run is under 25 minutes, minicore (lts and rolling) is the slowest part now. We can optimize the run time further with adding a new parallel job, splitting minicore, and balancing the workload among the jobs (not in this PR).
UPD: minicore (lts) and scaled_dot (rolling) are both ~17m and are in the critical path now. Could not make if faster with rebalancing tutorials, so the conclusion is the same.
Signed-off-by: Pavel Chekin <[email protected]>
Signed-off-by: Pavel Chekin <[email protected]>
Signed-off-by: Pavel Chekin <[email protected]>
Signed-off-by: Pavel Chekin <[email protected]>
Fixes #3820.
The run time reduced from 35m to 23m. Now "minicore" is in critical path.