-
Notifications
You must be signed in to change notification settings - Fork 32
[AMD] Added bufferOps refinement #776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: refine-ops-pass
Are you sure you want to change the base?
Conversation
9139083
to
dffd13e
Compare
Thanks Ravil! I see that refining buffer loads is a new function from traditional loads. We should think of a way to be able to combine many of these refinement function together to consodidate this code as much as possible. We can discuss offline. Maybe there's a way of keeping the common code in the function, but passing in a functor which does the unique operations. |
@guacamoleo, I tested correctness of the FA kernel with refined buffer ops; numerics is correct |
Great. How about consolidating code? Is there any way to merge the loads into a single refinement function? I'm concerned about all the duplication which has been inherant in our support. If there's no way to do it right now, we'll want to address consolidating refinement code before we try to upstream this. |
1279786
to
fb98a6a
Compare
We discussed this offline; commit looks good after rebasing with branch. |
fb98a6a
to
e857d6c
Compare
dffd13e
to
c75746f
Compare
Just a reminder to look into the issue with 16-bit memory ops resulting https://github.com/ROCm/triton-internal/issues/699#issuecomment-2835306030 in conjunction with this commit. |
8590ec0
to
89c1f5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're just copying axisInfo from operand[0] of extract_slice to the extract slice op itself. Is it implicit in doing this that these values get re-calculated for the smaller sizes? It seems like there are cases where we do change contiguity and divisibility. For example, if an operand was 4 vgprs which need to be allocated contiguously, then we split that in half, the new operands are only 2 vgprs which don't need to be 4-contiguous.
Am I understand or mis-understanding this?
Yes, you are correct. The scenario that you describe may happen. Let me think about a solution |
d9ea8ba
to
240b651
Compare
Hi @guacamoleo, I added some to recompute the triton/lib/Analysis/AxisInfo.cpp Lines 1033 to 1090 in 91ac21d
|
240b651
to
91ac21d
Compare
@@ -626,6 +626,100 @@ struct LoadOpPattern : public RefineRewritePattern<triton::LoadOp> { | |||
} | |||
}; | |||
|
|||
struct AMDGCNBufferLoadOp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an explanatory comment regarding how refinement of buffer_load is more complex than that of global_load; it looks like we need to examine the refinement of masks, otherTensor and offsets and bring it all together. This'll make the function more understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/Analysis/test-alignment.mlir
Outdated
|
||
// ----- | ||
|
||
#mma = #ttg.amd_mfma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test!
I recall us determining that contiguity, divisitbility and constancy can all change from extract slice; can you add a test where all 3 change and we correctly test that behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to simply propagate the AxisInfo from extract_slice
. it is a part of the upstream
3070646
to
0341d75
Compare
d44e4d9
to
4951510
Compare
4951510
to
62a8438
Compare
Hi @guacamoleo, I rebased this PR using the latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ravil. I had looked at this more closely and seen that there are multiple differences from global_loads so it doesn't make sense to try and merge this with global_load.
This looks good if tests passing and merge conflict fixed.
@guacamoleo, I added the buffer ops refinement. I tested it with our GEMM kernel. Numerics is ok but performance dropped by a factor 2. @giuseros, what do you think?