Skip to content

Use separate layout encodings for block io operations #4192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexbaden opened this issue May 14, 2025 · 0 comments
Open

Use separate layout encodings for block io operations #4192

alexbaden opened this issue May 14, 2025 · 0 comments
Assignees

Comments

@alexbaden
Copy link
Contributor

Currently, a Triton load operation is identified as a candidate for 2D block load lowering with the attribute triton_intel_gpu.block_io. The operation itself produces a result with the dot_op layout, e.g.

#mma = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
      %21 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, triton_intel_gpu.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>> loc(#loc15)

However, the 2d block load typically does not produce a result that exactly matches the DPAS layout used by the tt.dot operand; some register shuffles are required. These register shuffles are currently generated during lowering of the load to LLVM (in LoadStoreOpToLLVM.cpp). This effectively hides the layout conversion, and every op looks like it has the same layout.

This approach has several downsides:

  1. It is not easy to introspect the differences in the load layout vs the DPAS layout. We have to dump the shuffle vectors or scroll through quite a lot of LLVM IR.
  2. It is also to add more complex mappings (e.g. for the A^T load) or to support 2d block load operations for non-DPAS layouts.
  3. We cannot make high level optimization decisions based on parameters of the block load - by the time we get to set the block load parameters we are quite low in the code. We currently attempt to make the biggest load possible because parameters are dictated by the DPAS layout, but we have seen examples where this gives bad performance.

We can introduce a new layout for the load operations to resolve these issues. The new layout will describe the subgroup 2d block io operation. Layout attributes can be introspected using triton-tensor-layout, can be unit tested independently, and are relatively cheap to add and maintain. We can generate the layout and add it to the load operations at the same place we currently add the block_io tag. And eventually we can insert a layout conversion to convert between the subgroup 2d block io operation and the DPAS operation, removing the shuffle vector generation from the Load LLVM lowering. The layout conversion will still be a simple register shuffle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant