[AMD] added slicing `ttg.async_copy_global_to_local` #797

ravil-mobile · 2025-05-15T12:17:01Z

> TRITON_HIP_USE_BLOCK_PINGPONG=1 TRITON_HIP_USE_ASYNC_COPY=1 pytest -s -v op_tests/triton_tests/test_gemm_afp4wfp4.py

...

================================================================================================= warnings summary =================================================================================================
op_tests/triton_tests/test_gemm_afp4wfp4.py::test_gemm_afp4_wfp4[dtype0-1024-1024-1024]
  /home/rdorozhi/work/aiter/op_tests/triton_tests/test_gemm_afp4wfp4.py:92: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:328.)
    return torch.mm(x_f32, w_f32).to(dtype)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================== 74 passed, 1 warning in 65.59s (0:01:05) =====================================================================================

4-stage FA experiment Cluster assignment

…anonicalize can fold it

…m is contiguous

…ory ops cluster

…it barrier from Membar

…usters

…ct to lds loads

…der in the loop

I'll submit a PR upstream later.

This reverts commit 718ee20.

Initial support over already arranged ops.

…ction based on the loop. This is not meant as a permanent solution just to make this branch useable for other workloads

Computation part interleaves mfma and ds_read Placed extra conditional barrier to overlap computation part and buffer_load part. Dot slicing by plognjen at https://github.com/plognjen/triton/tree/slice_dot_scaled requires vmcnt fix to achieve full performance.

Fix incorrect condition to choose enable transforms. Fix missing tokens to the local_load

Only enable for 256x256x256 tilesize

…950 (triton-lang#6744)" This reverts commit 4ecc496.

…#6754)" This reverts commit f3076b1.

… BufferLoadToLocal to avoid implicit barrier from Membar" This reverts commit 012793a.

ravil-mobile · 2025-05-16T12:24:55Z

@jungpark-mlir, @raikonenfnu. Thanks for your comments! I addressed all of them.

Overlapping buffer_load and local_load+dot

raikonenfnu · 2025-05-19T05:17:20Z

third_party/amd/lib/TritonAMDGPUTransforms/BlockPingpong.cpp

+    auto sizePerThread = encoding.getSizePerThread();
+    SmallVector<unsigned> threadPerWarp(warpsPerCTA.size(), 0);
+    for (size_t dim = 0; dim < numDims; ++dim) {
+      threadPerWarp[dim] =


I have a silly Q for educational purposes, looks like we are only updating the threadsPerWarp here, does it not make more sense to updated the sizePerThread in the newEncoding?

No, we want to preserve the number of elements per thread (i.e., each holds 16 consecutive elements of a tensor). We just want to change the layout of threads; thus, to change which part of a tensor is held by CTA.

Each CTA holds 64x128 tile of a tensor in the following example

#blocked = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [8, 8], warpsPerCTA = [8, 1], order = [1, 0]}>

But if we change threadsPerWarp to [32,2] then a CTA holds 256x32 tile of a tensor.

#blocked3 = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>

In both cases, a thread holds 16 consecutive elements which determines the width of load instructions

Thanks for the answer, that's really interesting! A follow up question, if we change the which part of tensor the CTA will hold, wouldn't we need to have an extra global read (or potentially fused but wider global read) to get that part of tensor?

ConverLayout results in some machine ops if it cannot be fully optimized (propagated till the top of a function). If I remember correctly, the layout change is going to happen in LDS. In out case, it is going to be optimized

Nice! didn't realize this was pre-layout propagation phase, but this makes a lot of sense, thanks :)

raikonenfnu · 2025-05-19T05:19:42Z

third_party/amd/lib/TritonAMDGPUTransforms/BlockPingpong.cpp

+            builder
+                .create<ttg::ConvertLayoutOp>(tensor.getLoc(), newType, tensor)
+                .getResult();
+        slicedTensorType =
+            RankedTensorType::get(slicedShape, elemType, newEncoding);


Another silly Q, I know amdgpu.extract_slice src and dst layout need to match, is this the main purpose of ttg.convert_layout to set up layout for stuff like extract_slice where the layout is for sure to change? On top of that, does that mean ttg.convert_layout for slices will most likely not have a matching layout to it's shape?

dst layout is determined by the source layout. The problem comes from the following. Let's assume we want to slice 256x128 tensor into 4 pieces of 256x32 tiles. Let's also assume that the original layout is

# orig-layout #blocked = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [8, 8], warpsPerCTA = [8, 1], order = [1, 0]}>

which holds 64x128 tiles per CTA. ExtractSliceOp is the CTA level op - i.e., it can slice a tensor if a new size is proportional to the CTA tile - i.e.,64x128 in our case. Therefore, we cannot apply ExtractSliceOp to our tensor with orig-layout. Thus, we change the source layout to a new one - i.e.,

# new-layout #blocked3 = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>

which has its CTA-level slice equaled to 256x32. Now we can slice 256x128 tensor into 4 pieces. (note: it was determined by dot-slicing)

third_party/amd/lib/TritonAMDGPUTransforms/BlockPingpong.cpp

raikonenfnu · 2025-05-19T05:31:21Z

third_party/amd/lib/TritonAMDGPUTransforms/BlockPingpong.cpp

+      RankedTensorType newType = nullptr;
+      Value newTensor = nullptr;
+      RankedTensorType slicedTensorType = nullptr;


NIT: IIRC, RankedTensorType/Type/Value will default to null even if you don't explicitly set nullptr

You are right. I just like to be explicit in the code about initialization of all local variables (old habit).

…t on GFX950 (triton-lang#6744)"" This reverts commit f6065b9.

…ton-lang#6844) This commit improves how we create the mfma-like layout for optimizing global store by using linear layout composition. Along the way fixes a few implemenation issues. --------- Co-authored-by: Yi Qian <[email protected]>

the backend

avoid wrongly enabled.

Requirement to enable the transform : mxfp4, 128x128x512 tile size, async_copy, num_stages=2, num_warps=8

Signed-off-by: Ilya Veselov <[email protected]>

Use `TRITON_HIP_ASYNC_COPY_OVERLAP=1` env to enable async copy overlap

…launch-update-rebase

The "concat" operation combines a list of source n-dimensional tensors into a single larger destination tensor. All source tensors must have the same shape, element type, and encoding. The concatenation dimension is inferred from the source and destination shapes provided by the user. For example, two tensors of shape 64x128 can produce a destination shape of 128x128, indicating concatenation along dimension 0; or 64x256, indicating concatenation along dimension 1. Generally, source tensors passed as op arguments can be arranged into the resulting shape in multiple ways. For example, given four tensors of shape 64x64: concat s0<64x64>, s1<64x64>, s2<64x64>, s3<64x64> -> <128x128> They can be laid out in different configurations within the result tensor: 1) s0 s1 s2 s3 2) s0 s2 s1 s3 From a logical tensor perspective, the source tensors are treated as elements of a tensor of tensors. In other words, the 1-D array of input tensors is conceptually reshaped into an n-D grid. The semantics of this op assume a row-major order (or its n-D generalization), meaning the fastest-varying dimension is filled first, and the slowest-varying dimension is filled last. In the example above, this corresponds to layout 1). The source and destination tensors must have identical linear layouts at the CTA tile level. That is, all base vectors for input dimensions must match, except for the register input dimension. The register basis must align on the subset that defines the logical tensor shape of a single CTA tile. This ensures that the concatenation is a no-op, meaning no data rearrangement among threads is required to assemble the destination tensor with the given shape and layout. However, the order of CTA tiles within the layout does not need to match between source and destination layouts. It is the responsibility of the op's lowering logic to handle this correctly. This op is designed to work on logical tensors directly, avoiding the need for complex layout reinterpretation or reshaping. For example, the `tt.join` operation only supports concatenation along the innermost dimension, and requires that the resulting innermost dimension provide 2 elements per thread, distributed across registers. In contrast, this `concat` op imposes no constraints on the concatenation dimension or the size of dimensions. --------- Co-authored-by: Ognjen Plavsic <[email protected]> Co-authored-by: Lei Zhang <[email protected]>

AlexAUT and others added 30 commits May 13, 2025 17:19

[FA] 4-stage FA pipeliner

826bda0

4-stage FA experiment Cluster assignment

[FA] Add FA scripts

c35e297

[FA] Place cvt layout in the same stage and cluster as LocalLoad so c…

06cf75a

…anonicalize can fold it

[ASYNC_COPY] Add env var to bypass permute, only works if the load di…

203fe11

…m is contiguous

[FA] Do not combine AsyncWaits to have a barrier in front of each mem…

b664353

…ory ops cluster

[ASYNC_COPY] Remove MemoryEffect of BufferLoadToLocal to avoid implic…

012793a

…it barrier from Membar

[FA] Compute max before mul QK_SCALE to fold sub into fma

3b74f4a

[FA] Added 2 extra clusters to have async_waits in front of memory cl…

b059372

…usters

[FA] Place LocalLoads before AsyncCopies

f3bb293

[FA][ASYNC_COPY] Force vec=8 for shared encodings to avoid 32bit dire…

77884fa

…ct to lds loads

[FA] Place dots at the top of clusters

b2e2ad0

[FA] Split 4-stage clusters into 8 clusters to better controll the or…

fab1281

…der in the loop

[FA] Revert order change in SM clusters

e0ea5e7

[FA] Set vecSize=nonKDim for V shared layout to avoid bank conflicts

1198462

I'll submit a PR upstream later.

[FA] Removed old vectorSize workaround

fb186d4

[FA] Revert "Place AsyncWait at the top of schedule"

3212481

This reverts commit 718ee20.

[FA][PINGPONG] Add support for FAv3 pingpong.

34beed7

Initial support over already arranged ops.

[FA][PINGPONG] Allow block pingpong with num_stages==4

3861063

[FA][PINGPONG] Bail out if async wait count != 2

fc6d1d9

[FA] Do not pipeline second loop (causal)

d6a0419

[FA] Split FourStagePipeliner to separate file and do very basic sele…

1d1e8cc

…ction based on the loop. This is not meant as a permanent solution just to make this branch useable for other workloads

[GEMM] Add combine dot_scaled and addF

d46f750

[GEMM] Do not swizzle the scale

4a5ece6

Add layout conversion pass optim at the end

8285bfc

Fix to the gemm pingpong.

b3c2f94

Fix incorrect condition to choose enable transforms. Fix missing tokens to the local_load

Add restriction to dot_scaled pingpong.

a19dd6d

Only enable for 256x256x256 tilesize

Revert "[AMD] Use v_permlane to optimize MFAM to linear layout on GFX…

f6065b9

…950 (triton-lang#6744)" This reverts commit 4ecc496.

Revert "[BACKEND] bump to llvm/llvm-project@3c709802d31b (triton-lang…

d7e2e2c

…#6754)" This reverts commit f3076b1.

Revert because no longer needed: "[ASYNC_COPY] Remove MemoryEffect of…

247f4f4

… BufferLoadToLocal to avoid implicit barrier from Membar" This reverts commit 012793a.

ravil-mobile force-pushed the shared/triton-gfx950-launch-update branch from 9276703 to 24ae652 Compare May 16, 2025 12:23

ravil-mobile requested review from jungpark-mlir and raikonenfnu May 16, 2025 12:24

jungpark-mlir added 2 commits May 17, 2025 10:44

Add initial support for skinny mxfp gemm

c5c0e67

Overlapping buffer_load and local_load+dot

add AB load separated pingpong for skinny gemm.

bcc871d

raikonenfnu reviewed May 19, 2025

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/BlockPingpong.cpp Outdated Show resolved Hide resolved

raikonenfnu reviewed May 19, 2025

View reviewed changes

[AMD] add slicing async-copy-local-to-global

1b2a86b

ravil-mobile force-pushed the shared/triton-gfx950-launch-update branch from 24ae652 to 1b2a86b Compare May 19, 2025 09:36

ravil-mobile requested a review from raikonenfnu May 19, 2025 09:47

antiagainst and others added 7 commits May 19, 2025 09:39

Revert "Revert "[AMD] Use v_permlane to optimize MFAM to linear layou…

33f6ce9

…t on GFX950 (triton-lang#6744)"" This reverts commit f6065b9.

[ASYNCCOPY] Simplify swizzling calculations to get better codegen from

aebdfd7

the backend

Code cleanup

6527f10

avoid wrongly enabled.

Add skinny pingpong transform

1082cd2

Requirement to enable the transform : mxfp4, 128x128x512 tile size, async_copy, num_stages=2, num_warps=8

[FA] Disable pipelining for causal loop

5c4b1fb

[AMD] Add an option to force async copy overlapping

18ae32b

Signed-off-by: Ilya Veselov <[email protected]>

ravil-mobile force-pushed the shared/triton-gfx950-launch-update branch from 67b292c to 3167930 Compare May 21, 2025 10:48

joviliast and others added 3 commits May 21, 2025 15:19

[AMD] Add an option to force async copy overlapping

77c00fa

Use `TRITON_HIP_ASYNC_COPY_OVERLAP=1` env to enable async copy overlap

[AMD] Improved CanonicalizePointers for ExtractSlice

c5ceb64

Merge branch 'shared/triton-gfx950-launch' into shared/triton-gfx950-…

a89b3b4

…launch-update-rebase

ravil-mobile force-pushed the shared/triton-gfx950-launch-update branch from 3167930 to a89b3b4 Compare May 21, 2025 14:33

plognjen and others added 3 commits May 21, 2025 14:51

WA for incorrect strides in subview

6a6fb70

[AMD] improved subviewing for async-copy-local-to-global

34538bc

ravil-mobile force-pushed the shared/triton-gfx950-launch-update branch from 20f8e72 to 34538bc Compare May 26, 2025 13:37

antiagainst force-pushed the shared/triton-gfx950-launch branch from 77c00fa to a259f0a Compare May 26, 2025 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD] added slicing `ttg.async_copy_global_to_local` #797

[AMD] added slicing `ttg.async_copy_global_to_local` #797

Uh oh!

ravil-mobile commented May 15, 2025

Uh oh!

ravil-mobile commented May 16, 2025

Uh oh!

raikonenfnu May 19, 2025

Uh oh!

ravil-mobile May 19, 2025

Uh oh!

raikonenfnu May 19, 2025

Uh oh!

ravil-mobile May 19, 2025 •

edited

Loading

Uh oh!

raikonenfnu May 19, 2025

Uh oh!

raikonenfnu May 19, 2025

Uh oh!

ravil-mobile May 19, 2025

Uh oh!

Uh oh!

raikonenfnu May 19, 2025

Uh oh!

ravil-mobile May 19, 2025

Uh oh!

Uh oh!

[AMD] added slicing ttg.async_copy_global_to_local #797

Are you sure you want to change the base?

[AMD] added slicing ttg.async_copy_global_to_local #797

Uh oh!

Conversation

ravil-mobile commented May 15, 2025

Uh oh!

ravil-mobile commented May 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravil-mobile May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[AMD] added slicing `ttg.async_copy_global_to_local` #797

[AMD] added slicing `ttg.async_copy_global_to_local` #797

ravil-mobile May 19, 2025 •

edited

Loading