Bypass LDS for scale B operand for skinny gemms #817

plognjen · 2025-05-29T15:20:48Z

Skip LDS for the scale B tensor when warpsPerCTA is {1, numWarps} and
the load layout matches the expected layout for scale B in the dotScaled op.

guacamoleo · 2025-05-30T20:12:45Z

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipeline.cpp

+          mlir::triton::LinearLayout scaleBLayout =
+              mlir::triton::gpu::toLinearLayout(scaleBTy.getShape(),
+                                                scaleBTy.getEncoding());
+          bypassLDS = bypassLDS ||


What is this doing here? Is it checking if bypassing LDS succeeded?

I think @plognjen wanted to restore the previous condition, i.e. width < 32 should bypassLDS.
If this is the case, maybe we can use another variable to store the value of (width < 32) rather than bypassLDS to avoid any confusions.

yes, this was to restore the previous condition. I will change the name.

guacamoleo · 2025-05-30T20:15:33Z

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipeline.cpp

@@ -672,7 +673,40 @@ void StreamPipeliner::assignMemoryLayouts() {
      // Only use shared memory when feeding into a dot op.
      loadInfo.usedByDot = true;
      // If the max continugous bits we can read is < 32, buffer in registers.
-      if (width >= 32) {
+      bool bypassLDS = width < 32;


So, we're only bypassing LDS when the we're loading smaller than dword, such as buffer_load_short or buffer_load_ushort?
Are there other cases when bypass LDS could be beneficial? If so, let's add a comment reminding us of those additional scenarios.

Due to preshuffling, width is guaranteed to be >= 32. Therefore, it's confusing to enable bypassLDS only when width < 32.
More generally, bypassLDS should not check width. Later it checks if the loaded layout is the same as the scale layout, and this makes sure width = 32.

Bypass LDS for scale B operand for skinny gemms

522e8e3

plognjen marked this pull request as ready for review May 29, 2025 15:21

plognjen requested review from antiagainst and zhanglx13 as code owners May 29, 2025 15:21

guacamoleo reviewed May 30, 2025

View reviewed changes

oplavsic added 3 commits June 6, 2025 14:01

Change block layout of scale B load

92f130c

Remove coalesceOp function

9331d6b

Fix remove layout conversion pass canonicalizer

f5a1263

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bypass LDS for scale B operand for skinny gemms #817

Bypass LDS for scale B operand for skinny gemms #817

Uh oh!

plognjen commented May 29, 2025 •

edited

Loading

Uh oh!

guacamoleo May 30, 2025

Uh oh!

zhanglx13 Jun 2, 2025

Uh oh!

plognjen Jun 6, 2025

Uh oh!

guacamoleo May 30, 2025

Uh oh!

zhanglx13 Jun 2, 2025

Uh oh!

Uh oh!

Bypass LDS for scale B operand for skinny gemms #817

Are you sure you want to change the base?

Bypass LDS for scale B operand for skinny gemms #817

Uh oh!

Conversation

plognjen commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guacamoleo May 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhanglx13 Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

plognjen Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

guacamoleo May 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhanglx13 Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

plognjen commented May 29, 2025 •

edited

Loading