Use separate layout encodings for block io operations #4192

alexbaden · 2025-05-14T00:05:51Z

Currently, a Triton load operation is identified as a candidate for 2D block load lowering with the attribute triton_intel_gpu.block_io. The operation itself produces a result with the dot_op layout, e.g.

#mma = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
      %21 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, triton_intel_gpu.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>> loc(#loc15)

However, the 2d block load typically does not produce a result that exactly matches the DPAS layout used by the tt.dot operand; some register shuffles are required. These register shuffles are currently generated during lowering of the load to LLVM (in LoadStoreOpToLLVM.cpp). This effectively hides the layout conversion, and every op looks like it has the same layout.

This approach has several downsides:

It is not easy to introspect the differences in the load layout vs the DPAS layout. We have to dump the shuffle vectors or scroll through quite a lot of LLVM IR.
It is also to add more complex mappings (e.g. for the A^T load) or to support 2d block load operations for non-DPAS layouts.
We cannot make high level optimization decisions based on parameters of the block load - by the time we get to set the block load parameters we are quite low in the code. We currently attempt to make the biggest load possible because parameters are dictated by the DPAS layout, but we have seen examples where this gives bad performance.

We can introduce a new layout for the load operations to resolve these issues. The new layout will describe the subgroup 2d block io operation. Layout attributes can be introspected using triton-tensor-layout, can be unit tested independently, and are relatively cheap to add and maintain. We can generate the layout and add it to the load operations at the same place we currently add the block_io tag. And eventually we can insert a layout conversion to convert between the subgroup 2d block io operation and the DPAS operation, removing the shuffle vector generation from the Load LLVM lowering. The layout conversion will still be a simple register shuffle.

The text was updated successfully, but these errors were encountered:

alexbaden self-assigned this May 14, 2025

alexbaden mentioned this issue May 14, 2025

Introduce Subgroup 2D Block Encoding #4193

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use separate layout encodings for block io operations #4192

Use separate layout encodings for block io operations #4192

alexbaden commented May 14, 2025

Use separate layout encodings for block io operations #4192

Use separate layout encodings for block io operations #4192

Comments

alexbaden commented May 14, 2025