You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, a Triton load operation is identified as a candidate for 2D block load lowering with the attribute triton_intel_gpu.block_io. The operation itself produces a result with the dot_op layout, e.g.
However, the 2d block load typically does not produce a result that exactly matches the DPAS layout used by the tt.dot operand; some register shuffles are required. These register shuffles are currently generated during lowering of the load to LLVM (in LoadStoreOpToLLVM.cpp). This effectively hides the layout conversion, and every op looks like it has the same layout.
This approach has several downsides:
It is not easy to introspect the differences in the load layout vs the DPAS layout. We have to dump the shuffle vectors or scroll through quite a lot of LLVM IR.
It is also to add more complex mappings (e.g. for the A^T load) or to support 2d block load operations for non-DPAS layouts.
We cannot make high level optimization decisions based on parameters of the block load - by the time we get to set the block load parameters we are quite low in the code. We currently attempt to make the biggest load possible because parameters are dictated by the DPAS layout, but we have seen examples where this gives bad performance.
We can introduce a new layout for the load operations to resolve these issues. The new layout will describe the subgroup 2d block io operation. Layout attributes can be introspected using triton-tensor-layout, can be unit tested independently, and are relatively cheap to add and maintain. We can generate the layout and add it to the load operations at the same place we currently add the block_io tag. And eventually we can insert a layout conversion to convert between the subgroup 2d block io operation and the DPAS operation, removing the shuffle vector generation from the Load LLVM lowering. The layout conversion will still be a simple register shuffle.
The text was updated successfully, but these errors were encountered:
Currently, a Triton load operation is identified as a candidate for 2D block load lowering with the attribute
triton_intel_gpu.block_io
. The operation itself produces a result with thedot_op
layout, e.g.However, the 2d block load typically does not produce a result that exactly matches the DPAS layout used by the
tt.dot
operand; some register shuffles are required. These register shuffles are currently generated during lowering of the load to LLVM (inLoadStoreOpToLLVM.cpp
). This effectively hides the layout conversion, and every op looks like it has the same layout.This approach has several downsides:
A^T
load) or to support 2d block load operations for non-DPAS layouts.We can introduce a new layout for the load operations to resolve these issues. The new layout will describe the subgroup 2d block io operation. Layout attributes can be introspected using
triton-tensor-layout
, can be unit tested independently, and are relatively cheap to add and maintain. We can generate the layout and add it to the load operations at the same place we currently add theblock_io
tag. And eventually we can insert a layout conversion to convert between the subgroup 2d block io operation and the DPAS operation, removing the shuffle vector generation from the Load LLVM lowering. The layout conversion will still be a simple register shuffle.The text was updated successfully, but these errors were encountered: