[AMD] Added bufferOps refinement #776

ravil-mobile · 2025-04-14T14:29:15Z

@guacamoleo, I added the buffer ops refinement. I tested it with our GEMM kernel. Numerics is ok but performance dropped by a factor 2. @giuseros, what do you think?

❯ AMDGCN_USE_BUFFER_OPS=1 TRITON_HIP_STREAM_MAX_DEPTH=1 MLIR_ENABLE_DUMP=0 TRITON_PRINT_AUTOTUNING=0 python3 ./gemm-ex.py -f ./exp-config.yaml --dump-ir ttgir
MASKING load/store: disabled
MATRIX B TRANSPOSED: false
use_bias=False
matmul_kernel
perf: 215.41678697824958 TFLOP/s
✅ Triton and Torch match

❯ AMDGCN_USE_BUFFER_OPS=1 TRITON_HIP_STREAM_MAX_DEPTH=1 MLIR_ENABLE_DUMP=1 TRITON_PRINT_AUTOTUNING=0 python3 ./gemm-ex.py -f ./exp-config.yaml --dump-ir ttgir --trans-b
MASKING load/store: disabled
MATRIX B TRANSPOSED: true
use_bias=False
matmul_kernel
perf: 213.28888766541104 TFLOP/s
✅ Triton and Torch match

❯ AMDGCN_USE_BUFFER_OPS=1 TRITON_HIP_STREAM_MAX_DEPTH=1 MLIR_ENABLE_DUMP=0 TRITON_PRINT_AUTOTUNING=0 python3 ./gemm-ex.py -f ./exp-config.yaml --dump-ir ttgir --trans-a
MASKING load/store: disabled
MATRIX B TRANSPOSED: false
use_bias=False
matmul_kernel
perf: 230.4643251881426 TFLOP/s
✅ Triton and Torch match

❯ AMDGCN_USE_BUFFER_OPS=1 TRITON_HIP_STREAM_MAX_DEPTH=1 MLIR_ENABLE_DUMP=0 TRITON_PRINT_AUTOTUNING=0 python3 ./gemm-ex.py -f ./exp-config.yaml --dump-ir ttgir --trans-a --trans-b
MASKING load/store: disabled
MATRIX B TRANSPOSED: true
use_bias=False
matmul_kernel
perf: 207.3770854286465 TFLOP/s
✅ Triton and Torch match

guacamoleo · 2025-04-14T15:16:53Z

Thanks Ravil!
I'm not worried about performance at this point, because we aren't really doing scheduling yet.

I see that refining buffer loads is a new function from traditional loads. We should think of a way to be able to combine many of these refinement function together to consodidate this code as much as possible. We can discuss offline. Maybe there's a way of keeping the common code in the function, but passing in a functor which does the unique operations.

ravil-mobile · 2025-04-14T17:00:59Z

@guacamoleo, I tested correctness of the FA kernel with refined buffer ops; numerics is correct

guacamoleo · 2025-04-14T20:19:14Z

@guacamoleo, I tested correctness of the FA kernel with refined buffer ops; numerics is correct

Great. How about consolidating code? Is there any way to merge the loads into a single refinement function? I'm concerned about all the duplication which has been inherant in our support. If there's no way to do it right now, we'll want to address consolidating refinement code before we try to upstream this.

guacamoleo · 2025-04-17T19:10:02Z

We discussed this offline; commit looks good after rebasing with branch.

guacamoleo · 2025-04-29T20:35:21Z

Just a reminder to look into the issue with 16-bit memory ops resulting https://github.com/ROCm/triton-internal/issues/699#issuecomment-2835306030 in conjunction with this commit.

guacamoleo

It looks like we're just copying axisInfo from operand[0] of extract_slice to the extract slice op itself. Is it implicit in doing this that these values get re-calculated for the smaller sizes? It seems like there are cases where we do change contiguity and divisibility. For example, if an operand was 4 vgprs which need to be allocated contiguously, then we split that in half, the new operands are only 2 vgprs which don't need to be 4-contiguous.
Am I understand or mis-understanding this?

ravil-mobile · 2025-05-05T16:46:29Z

It looks like we're just copying axisInfo from operand[0] of extract_slice to the extract slice op itself. Is it implicit in doing this that these values get re-calculated for the smaller sizes? It seems like there are cases where we do change contiguity and divisibility. For example, if an operand was 4 vgprs which need to be allocated contiguously, then we split that in half, the new operands are only 2 vgprs which don't need to be 4-contiguous. Am I understand or mis-understanding this?

Yes, you are correct. The scenario that you describe may happen. Let me think about a solution

ravil-mobile · 2025-05-06T15:30:47Z

Hi @guacamoleo, I added some to recompute the AxisInfo. Please, verify:

triton/lib/Analysis/AxisInfo.cpp

Lines 1033 to 1090 in 91ac21d

    
           auto srcType = cast<RankedTensorType>(op.getOperand().getType()); 
        
           auto srcShape = srcType.getShape(); 
        
           auto dstType = cast<RankedTensorType>(op.getResult().getType()); 
        
           auto dstShape = dstType.getShape(); 
        
           auto offsets = op.getStaticOffsets(); 
        
           AxisInfo opInfo = operands[0]->getValue(); 
        
           auto origContiguity = opInfo.getContiguity(); 
        
           auto origDivisibility = opInfo.getDivisibility(); 
        
           auto origConstancy = opInfo.getConstancy(); 
        
           auto recompute = [](ArrayRef<int64_t> vec, int64_t c) { 
        
             auto result = std::numeric_limits<int64_t>::max(); 
        
             for (auto &v : vec) { 
        
               // compute the upper bound of `v` based on `contiguity` 
        
               auto newC = ((v + c - 1) / c) * c - v; 
        
               // make sure that the new value is not broken because 
        
               // of the sliced boundaries 
        
               newC = newC == 0 ? c : newC; 
        
               // conside the minumal value along each dimension 
        
               result = result > newC ? newC : result; 
        
             } 
        
             assert(vec.size() == 2); 
        
             const auto dimSize = vec[1] - vec[0]; 
        
             // make sure that the value doesn't exceed the dimension size 
        
             return result > dimSize ? dimSize : result; 
        
           }; 
        
           SmallVector<int64_t> contiguity(origContiguity.size()); 
        
           SmallVector<int64_t> divisibility(opInfo.getDivisibility().size()); 
        
           SmallVector<int64_t> constancy(opInfo.getConstancy().size()); 
        
           for (size_t dim = 0; dim < opInfo.getRank(); ++dim) { 
        
             auto start = offsets[dim]; 
        
             auto end = start + dstShape[dim]; 
        
             contiguity[dim] = recompute({start, end}, origContiguity[dim]); 
        
             // note: contiguity cannot increase while slicing a tensor 
        
             assert(contiguity[dim] <= origContiguity[dim]); 
        
             constancy[dim] = recompute({start, end}, origConstancy[dim]); 
        
             divisibility[dim] = origDivisibility[dim]; 
        
             if (contiguity[dim] != origContiguity[dim]) { 
        
               // note: assume n is the largest power of two that divides `x` and `x + 
        
               // c` 
        
               // 1. x % n = 0 and 2. (x + c) % n = 0 
        
               // reminder of a sum can be calculated as: 3. (x + c) % n = (x % n + c % 
        
               // n) % n = 0 becuase of 1. one can write 4. (c % n) % n or 5. c % n = 0 
        
               divisibility[dim] = std::min( 
        
                   origDivisibility[dim], 
        
                   int64_t(log2Int(highestPowOf2Divisor<int64_t>(contiguity[dim])))); 
        
             } 
        
           }

guacamoleo · 2025-05-12T13:29:56Z

third_party/amd/lib/TritonAMDGPUTransforms/RefineOps.cpp

@@ -626,6 +626,100 @@ struct LoadOpPattern : public RefineRewritePattern<triton::LoadOp> {
  }
 };

+struct AMDGCNBufferLoadOp


Can you add an explanatory comment regarding how refinement of buffer_load is more complex than that of global_load; it looks like we need to examine the refinement of masks, otherTensor and offsets and bring it all together. This'll make the function more understandable.

guacamoleo · 2025-05-12T13:32:48Z

test/Analysis/test-alignment.mlir

+
+// -----
+
+#mma = #ttg.amd_mfma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}>


Thanks for adding a test!
I recall us determining that contiguity, divisitbility and constancy can all change from extract slice; can you add a test where all 3 change and we correctly test that behavior?

We decided to simply propagate the AxisInfo from extract_slice. it is a part of the upstream

ravil-mobile · 2025-06-23T15:30:48Z

Hi @guacamoleo, I rebased this PR using the latest refine-ops-pass branch. Could you, please, re-review

guacamoleo

Thanks, Ravil. I had looked at this more closely and seen that there are multiple differences from global_loads so it doesn't make sense to try and merge this with global_load.
This looks good if tests passing and merge conflict fixed.

ravil-mobile requested review from antiagainst and zhanglx13 as code owners April 14, 2025 14:29

ravil-mobile requested review from guacamoleo and giuseros April 14, 2025 14:29

ravil-mobile force-pushed the ravil/refine-buffer-ops branch from 9139083 to dffd13e Compare April 14, 2025 15:05

ravil-mobile force-pushed the refine-ops-pass branch from 1279786 to fb98a6a Compare April 17, 2025 12:34

ravil-mobile force-pushed the refine-ops-pass branch from fb98a6a to e857d6c Compare April 28, 2025 10:13

ravil-mobile force-pushed the ravil/refine-buffer-ops branch from dffd13e to c75746f Compare April 29, 2025 14:53

ravil-mobile force-pushed the ravil/refine-buffer-ops branch 2 times, most recently from 8590ec0 to 89c1f5f Compare April 30, 2025 13:46

guacamoleo reviewed May 5, 2025

View reviewed changes

ravil-mobile force-pushed the ravil/refine-buffer-ops branch 3 times, most recently from d9ea8ba to 240b651 Compare May 6, 2025 15:27

ravil-mobile force-pushed the ravil/refine-buffer-ops branch from 240b651 to 91ac21d Compare May 6, 2025 15:53

guacamoleo requested changes May 12, 2025

View reviewed changes

ravil-mobile force-pushed the refine-ops-pass branch 2 times, most recently from 3070646 to 0341d75 Compare June 23, 2025 14:48

ravil-mobile force-pushed the ravil/refine-buffer-ops branch from d44e4d9 to 4951510 Compare June 23, 2025 15:09

[AMD] Added bufferOps refinement

62a8438

ravil-mobile force-pushed the ravil/refine-buffer-ops branch from 4951510 to 62a8438 Compare June 23, 2025 15:25

ravil-mobile requested a review from guacamoleo June 23, 2025 15:30

guacamoleo approved these changes Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD] Added bufferOps refinement #776

[AMD] Added bufferOps refinement #776

Uh oh!

ravil-mobile commented Apr 14, 2025

Uh oh!

guacamoleo commented Apr 14, 2025

Uh oh!

ravil-mobile commented Apr 14, 2025

Uh oh!

guacamoleo commented Apr 14, 2025

Uh oh!

guacamoleo commented Apr 17, 2025

Uh oh!

guacamoleo commented Apr 29, 2025

Uh oh!

guacamoleo left a comment

Uh oh!

ravil-mobile commented May 5, 2025

Uh oh!

ravil-mobile commented May 6, 2025 •

edited

Loading

Uh oh!

guacamoleo May 12, 2025

Uh oh!

ravil-mobile Jun 23, 2025

Uh oh!

guacamoleo May 12, 2025

Uh oh!

ravil-mobile Jun 23, 2025

Uh oh!

ravil-mobile commented Jun 23, 2025

Uh oh!

guacamoleo left a comment

Uh oh!

Uh oh!


		// -----

		#mma = #ttg.amd_mfma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}>

[AMD] Added bufferOps refinement #776

Are you sure you want to change the base?

[AMD] Added bufferOps refinement #776

Uh oh!

Conversation

ravil-mobile commented Apr 14, 2025

Uh oh!

guacamoleo commented Apr 14, 2025

Uh oh!

ravil-mobile commented Apr 14, 2025

Uh oh!

guacamoleo commented Apr 14, 2025

Uh oh!

guacamoleo commented Apr 17, 2025

Uh oh!

guacamoleo commented Apr 29, 2025

Uh oh!

guacamoleo left a comment

Choose a reason for hiding this comment

Uh oh!

ravil-mobile commented May 5, 2025

Uh oh!

ravil-mobile commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guacamoleo May 12, 2025

Choose a reason for hiding this comment

Uh oh!

ravil-mobile Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

guacamoleo May 12, 2025

Choose a reason for hiding this comment

Uh oh!

ravil-mobile Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

ravil-mobile commented Jun 23, 2025

Uh oh!

guacamoleo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ravil-mobile commented May 6, 2025 •

edited

Loading