[AMDGPU] Implement workaround for GFX11.5 export priority #99273

perlfu · 2024-07-17T06:21:45Z

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports.

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports. Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports.

llvmbot · 2024-07-17T06:22:19Z

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports.

Patch is 28.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99273.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+12-3)
(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+112)
(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h (+1)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+3)
(added) llvm/test/CodeGen/AMDGPU/required-export-priority.ll (+344)
(added) llvm/test/CodeGen/AMDGPU/required-export-priority.mir (+293)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index dfc8eaea66f7b..14fcf6a210a78 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -947,6 +947,12 @@ def FeatureHasRestrictedSOffset : SubtargetFeature<"restricted-soffset",
   "Has restricted SOffset (immediate not supported)."
 >;
 
+def FeatureRequiredExportPriority : SubtargetFeature<"required-export-priority",
+  "HasRequiredExportPriority",
+  "true",
+  "Export priority must be explicitly manipulated on GFX11.5"
+>;
+
 //===------------------------------------------------------------===//
 // Subtarget Features (options and debugging)
 //===------------------------------------------------------------===//
@@ -1597,20 +1603,23 @@ def FeatureISAVersion11_5_0 : FeatureSet<
   !listconcat(FeatureISAVersion11_Common.Features,
     [FeatureSALUFloatInsts,
      FeatureDPPSrc1SGPR,
-     FeatureVGPRSingleUseHintInsts])>;
+     FeatureVGPRSingleUseHintInsts,
+     FeatureRequiredExportPriority])>;
 
 def FeatureISAVersion11_5_1 : FeatureSet<
   !listconcat(FeatureISAVersion11_Common.Features,
     [FeatureSALUFloatInsts,
      FeatureDPPSrc1SGPR,
      FeatureVGPRSingleUseHintInsts,
-     Feature1_5xVGPRs])>;
+     Feature1_5xVGPRs,
+     FeatureRequiredExportPriority])>;
 
 def FeatureISAVersion11_5_2 : FeatureSet<
   !listconcat(FeatureISAVersion11_Common.Features,
     [FeatureSALUFloatInsts,
      FeatureDPPSrc1SGPR,
-     FeatureVGPRSingleUseHintInsts])>;
+     FeatureVGPRSingleUseHintInsts,
+     FeatureRequiredExportPriority])>;
 
 def FeatureISAVersion12 : FeatureSet<
   [FeatureGFX12,
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index a402fc6d7e611..a8b171aa82840 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -14,6 +14,7 @@
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
 #include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/ScheduleDAG.h"
 #include "llvm/TargetParser/TargetParser.h"
@@ -1104,6 +1105,7 @@ void GCNHazardRecognizer::fixHazards(MachineInstr *MI) {
   fixWMMAHazards(MI);
   fixShift64HighRegBug(MI);
   fixVALUMaskWriteHazard(MI);
+  fixRequiredExportPriority(MI);
 }
 
 bool GCNHazardRecognizer::fixVcmpxPermlaneHazards(MachineInstr *MI) {
@@ -2895,3 +2897,113 @@ bool GCNHazardRecognizer::fixVALUMaskWriteHazard(MachineInstr *MI) {
 
   return true;
 }
+
+static bool ensureEntrySetPrio(MachineFunction *MF, int Priority,
+                               const SIInstrInfo &TII) {
+  MachineBasicBlock &EntryMBB = MF->front();
+  if (EntryMBB.begin() != EntryMBB.end()) {
+    auto &EntryMI = *EntryMBB.begin();
+    if (EntryMI.getOpcode() == AMDGPU::S_SETPRIO &&
+        EntryMI.getOperand(0).getImm() >= Priority)
+      return false;
+  }
+
+  BuildMI(EntryMBB, EntryMBB.begin(), DebugLoc(), TII.get(AMDGPU::S_SETPRIO))
+      .addImm(Priority);
+  return true;
+}
+
+bool GCNHazardRecognizer::fixRequiredExportPriority(MachineInstr *MI) {
+  if (!ST.hasRequiredExportPriority())
+    return false;
+
+  // Assume the following shader types will never have exports,
+  // and avoid adding or adjusting S_SETPRIO.
+  MachineBasicBlock *MBB = MI->getParent();
+  MachineFunction *MF = MBB->getParent();
+  auto CC = MF->getFunction().getCallingConv();
+  switch (CC) {
+  case CallingConv::AMDGPU_CS:
+  case CallingConv::AMDGPU_CS_Chain:
+  case CallingConv::AMDGPU_CS_ChainPreserve:
+  case CallingConv::AMDGPU_KERNEL:
+    return false;
+  default:
+    break;
+  }
+
+  const int MaxPriority = 3;
+  const int NormalPriority = 2;
+  const int PostExportPriority = 0;
+
+  auto It = MI->getIterator();
+  switch (MI->getOpcode()) {
+  case AMDGPU::S_ENDPGM:
+  case AMDGPU::S_ENDPGM_SAVED:
+  case AMDGPU::S_ENDPGM_ORDERED_PS_DONE:
+  case AMDGPU::SI_RETURN_TO_EPILOG:
+    // Ensure shader with calls raises priority at entry.
+    // This ensures correct priority if exports exist in callee.
+    if (MF->getFrameInfo().hasCalls())
+      return ensureEntrySetPrio(MF, NormalPriority, TII);
+    return false;
+  case AMDGPU::S_SETPRIO: {
+    // Raise minimum priority unless in workaround.
+    auto &PrioOp = MI->getOperand(0);
+    int Prio = PrioOp.getImm();
+    bool InWA = (Prio == PostExportPriority) &&
+                (It != MBB->begin() && TII.isEXP(*std::prev(It)));
+    if (InWA || Prio >= NormalPriority)
+      return false;
+    PrioOp.setImm(std::min(Prio + NormalPriority, MaxPriority));
+    return true;
+  }
+  default:
+    if (!TII.isEXP(*MI))
+      return false;
+    break;
+  }
+
+  // Check entry priority at each export (as there will only be a few).
+  // Note: amdgpu_gfx can only be a callee, so defer to caller setprio.
+  bool Changed = false;
+  if (CC != CallingConv::AMDGPU_Gfx)
+    Changed = ensureEntrySetPrio(MF, NormalPriority, TII);
+
+  auto NextMI = std::next(It);
+  bool EndOfShader = false;
+  if (NextMI != MBB->end()) {
+    // Only need WA at end of sequence of exports.
+    if (TII.isEXP(*NextMI))
+      return Changed;
+    // Assume appropriate S_SETPRIO after export means WA already applied.
+    if (NextMI->getOpcode() == AMDGPU::S_SETPRIO &&
+        NextMI->getOperand(0).getImm() == PostExportPriority)
+      return Changed;
+    EndOfShader = NextMI->getOpcode() == AMDGPU::S_ENDPGM;
+  }
+
+  const DebugLoc &DL = MI->getDebugLoc();
+
+  // Lower priority.
+  BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_SETPRIO))
+      .addImm(PostExportPriority);
+
+  if (!EndOfShader) {
+    // Wait for exports to complete.
+    BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_WAITCNT_EXPCNT))
+        .addReg(AMDGPU::SGPR_NULL)
+        .addImm(0);
+  }
+
+  BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_NOP)).addImm(0);
+  BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_NOP)).addImm(0);
+
+  if (!EndOfShader) {
+    // Return to normal (higher) priority.
+    BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_SETPRIO))
+        .addImm(NormalPriority);
+  }
+
+  return true;
+}
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index 3ccca527c626b..f2a64ab48e180 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -107,6 +107,7 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   bool fixWMMAHazards(MachineInstr *MI);
   bool fixShift64HighRegBug(MachineInstr *MI);
   bool fixVALUMaskWriteHazard(MachineInstr *MI);
+  bool fixRequiredExportPriority(MachineInstr *MI);
 
   int checkMAIHazards(MachineInstr *MI);
   int checkMAIHazards908(MachineInstr *MI);
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e5817594a4521..def89c785b855 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -238,6 +238,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool HasVOPDInsts = false;
   bool HasVALUTransUseHazard = false;
   bool HasForceStoreSC0SC1 = false;
+  bool HasRequiredExportPriority = false;
 
   bool RequiresCOV6 = false;
 
@@ -1282,6 +1283,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool hasRestrictedSOffset() const { return HasRestrictedSOffset; }
 
+  bool hasRequiredExportPriority() const { return HasRequiredExportPriority; }
+
   /// \returns true if the target uses LOADcnt/SAMPLEcnt/BVHcnt, DScnt/KMcnt
   /// and STOREcnt rather than VMcnt, LGKMcnt and VScnt respectively.
   bool hasExtendedWaitCounts() const { return getGeneration() >= GFX12; }
diff --git a/llvm/test/CodeGen/AMDGPU/required-export-priority.ll b/llvm/test/CodeGen/AMDGPU/required-export-priority.ll
new file mode 100644
index 0000000000000..377902f3f0d1a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/required-export-priority.ll
@@ -0,0 +1,344 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1150 -amdgpu-enable-vopd=0 -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
+
+define amdgpu_ps void @test_export_zeroes_f32() #0 {
+; GCN-LABEL: test_export_zeroes_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 0
+; GCN-NEXT:    exp mrt0 off, off, off, off
+; GCN-NEXT:    exp mrt0 off, off, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float 0.0, float 0.0, float 0.0, float 0.0, i1 false, i1 false)
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float 0.0, float 0.0, float 0.0, float 0.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_ps void @test_export_en_src0_f32() #0 {
+; GCN-LABEL: test_export_en_src0_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 v3, off, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 1, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_gs void @test_export_gs() #0 {
+; GCN-LABEL: test_export_gs:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_hs void @test_export_hs() #0 {
+; GCN-LABEL: test_export_hs:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_gfx void @test_export_gfx(float %v) #0 {
+; GCN-LABEL: test_export_gfx:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v1, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v2, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v3, 2.0
+; GCN-NEXT:    exp mrt0 off, v3, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float %v, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_cs void @test_export_cs() #0 {
+; GCN-LABEL: test_export_cs:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_kernel void @test_export_kernel() #0 {
+; GCN-LABEL: test_export_kernel:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_gfx void @test_no_export_gfx(float %v) #0 {
+; GCN-LABEL: test_no_export_gfx:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  ret void
+}
+
+define amdgpu_ps void @test_no_export_ps(float %v) #0 {
+; GCN-LABEL: test_no_export_ps:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_endpgm
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB9_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB9_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 false, i1 false)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_vm_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_vm_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB10_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB10_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 false, i1 true)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_done_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_done_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB11_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4 done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB11_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 false)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_vm_done_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_vm_done_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB12_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4 done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB12_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_export_pos_before_param_across_load(i32 %idx) #0 {
+; GCN-LABEL: test_export_pos_before_param_across_load:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen
+; GCN-NEXT:    v_mov_b32_e32 v1, 0
+; GCN-NEXT:    v_mov_b32_e32 v2, 1.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 0.5
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    exp pos0 v1, v1, v1, v0 done
+; GCN-NEXT:    exp invalid_target_32 v2, v2, v2, v2
+; GCN-NEXT:    exp invalid_target_33 v2, v2, v2, v3
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float 1.0, float 1.0, float 1.0, float 1.0, i1 false, i1 false)
+  call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float 1.0, float 1.0, float 1.0, float 0.5, i1 false, i1 false)
+  %load = call float @llvm.amdgcn.raw.ptr.buffer.load.f32(ptr addrspace(8) undef, i32 %idx, i32 0, i32 0)
+  call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float %load, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_ps void @test_export_across_store_load(i32 %idx, float %v) #0 {
+; GCN-LABEL: test_export_across_store_load:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v2, 16
+; GCN-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v0
+; GCN-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GCN-NEXT:    v_cndmask_b32_e64 v0, v2, 0, vcc_lo
+; GCN-NEXT:    v_mov_b32_e32 v2, 0
+; GCN-NEXT:    scratch_store_b32 v0, v1, off
+; GCN-NEXT:    scratch_load_b32 v0, off, off
+; GCN-NEXT:    v_mov_b32_e32 v1, 1.0
+; GCN-NEXT:    exp pos0 v2, v2, v2, v1 done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    exp invalid_target_32 v0, v2, v1, v2
+; GCN-NEXT:    exp invalid_target_33 v0, v2, v1, v2
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  %data0 = alloca <4 x float>, align 8, addrspace(5)
+  %data1 = alloca <4 x float>, align 8, addrspace(5)
+  %cmp = icmp eq i32 %idx, 1
+  %data = select i1 %cmp, ptr addrspace(5) %data0, ptr addrspace(5) %data1
+  store float %v, ptr addrspace(5) %data, align 8
+  call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float 1.0, i1 true, i1 false)
+  %load0 = load float, ptr addrspace(5) %data0, align 8
+  call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float %load0, float 0.0, float 1.0, float 0.0, i1 false, i1 false)
+  call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float %load0, float 0.0, float 1.0, float 0.0, i1 false, i1 false)
+  ret void
+}
+
+define amdgpu_ps void @test_export_in_callee(float %v) #0 {
+; GCN-LABEL: test_export_in_callee:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_getpc_b64 s[0:1]
+; GCN-NEXT:    s_add_u32 s0, s0, test_export_gfx@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s1, s1, test_export_gfx@gotpcrel32@hi+12
+; GCN-NEXT:    v_add_f32_e32 v0, 1.0, v0
+; GCN-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GCN-NEXT:    s_endpgm
+  %x = fadd float %v, 1.0
+  call void @test_export_gfx(float %x)
+  ret void
+}
+
+define amdgpu_ps void @test_export_in_callee_prio(float %v) #0 {
+; GCN-LABEL: test_export_in_callee_prio:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    v_add_f32_e32 v0, 1.0, v0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_getpc_b64 s[0:1]
+; GCN-NEXT:    s_add_u32 s0, s0, test_export_gfx@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s1, s1, test_export_gfx@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GCN-NEXT:    s_endpgm
+  %x = fadd float %v, 1.0
+  call void @llvm.amdgcn.s.setprio(i16 0)
+  call void @test_export_gfx(float %x)
+  ret void
+}
+
+declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #1
+declare void @llvm.amdgcn.exp.i32(i32, i32, i32, i32, i32, i32, i1, i1) #1
+declare float @llvm.amdgcn.raw.ptr.buffer.load.f32(ptr addrspace(8), i32, i32, i32) #2
+declare void @llvm.amdgcn.s.setprio(i16)
+
+attributes #0 = { nounwind }
+attributes #1 = { nounwind inaccessiblememonly }
+attributes #2 = { nounwind readnone }
diff --git a/llvm/test/CodeGen/AMDGPU/required-export-priority.mir b/llvm/test/CodeGen/AMDGPU/required-export-priority.mir
new file mode 100644
index 0000000000000..eee04468036e5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/required-expo...
[truncated]

arsenm · 2024-07-17T07:09:37Z

llvm/lib/Target/AMDGPU/AMDGPU.td

@@ -1597,20 +1603,23 @@ def FeatureISAVersion11_5_0 : FeatureSet<
  !listconcat(FeatureISAVersion11_Common.Features,
    [FeatureSALUFloatInsts,
     FeatureDPPSrc1SGPR,
-     FeatureVGPRSingleUseHintInsts])>;
+     FeatureVGPRSingleUseHintInsts,


Bug workaround also need to be added to the generic target

I don't think there is a generic target for 11.5, these don't apply to all of 11, only the three 11.5 targets definitions.

There is ISAVersion11_Generic

gfx11-generic includes gfx115x.

I don't think there is a generic target for 11.5, these don't apply to all of 11, only the three 11.5 targets definitions.

If this is a codegen pessimization to fix a bug, and it's not incompatible with gfx110x (just makes them slower), then it needs to go in the generic target.

Sure.

There seems to be some inconsistency with generics in general.
GFX10 has separate 10.1, 10.3 generic targets, I guess because .0 is not defined.
GFX 9 generic target doesn't follow this workaround absorbing convention, and 9.4 doesn't have a generic target.
I guess splitting GFX11 to .0 and .5 now would not make sense if someone was using the gfx11-generic target, as it would still need to stay defined?

GFX10 has separate 10.1, 10.3 generic targets

It's inconsistent because the hardware is inconsistent. Ideally they wouldn't have broken compatibility so badly between minor gfx10 revisions

GFX 9 generic target doesn't follow this workaround absorbing convention,

It certainly is supposed to

There seems to be some inconsistency with generics in general.

We have 2 families for GFX10 because GFX101x and GFX103x have incompatible ISAs due to encoding changes.

GFX10 has separate 10.1, 10.3 generic targets, I guess because .0 is not defined.

The naming convention is just:

If we have one family for the whole generation, just use the family name (gfx9-generic)

If we have more, then name each target after the earlier/lowest number in the family (gfx101x -> gfx10.1-generic)

GFX 9 generic target doesn't follow this workaround absorbing convention

What do you mean? We should have all workarounds included in gfx9-generic, if not, that's an issue.
Can you please tell me which workaround you think is missing? I'll follow up on it.

and 9.4 doesn't have a generic target.

Generic targets are for the case where one decides that performance isn't essential (or ideal performance cannot be achieved anyway), and it just needs to work and get up and running fast (= easy to distribute). In those cases, building only 2 targets for GFX10 instead of 11 is a huge time and space saver.

It makes little to no sense for accelerators (e.g; gfx90a) to have generic targets for multiple reasons:

They're designed for max performance on compute workloads. Performance >>> build times here.

It's pretty much impossible to make a generic target for, e.g. gfx9 that includes MI. The biggest issue IIRC is that the memory model is radically different between consumer and MI gfx9.

GFX 9 generic target doesn't follow this workaround absorbing convention

What do you mean? We should have all workarounds included in gfx9-generic, if not, that's an issue. Can you please tell me which workaround you think is missing? I'll follow up on it.

Thanks!
On a further read I was tripped up by the workarounds being in FeatureISAVersion9_0_Consumer_Common which is included in FeatureISAVersion9_Generic, rather than directly being in FeatureISAVersion9_Generic.
Your explanation is helpful to understand that "gfx9-generic" basic means "gfx9-consumer-generic"?
i.e. FeatureISAVersion9_0_Consumer_Common = FeatureISAVersion9_Generic = "gfx9-generic".
With the MI parts not being in this target, but still apply some of the same workarounds.

I had also forgotten binaries are built for multiple targets in the compute chain; hence why you don't want more targets.

llvm/test/CodeGen/AMDGPU/required-export-priority.ll

marekolsak · 2024-07-17T11:21:25Z

Can this also be backported to LLVM 18?

jayfoad · 2024-07-17T12:34:12Z

Can this also be backported to LLVM 18?

No more 18.x releases are planned: https://discourse.llvm.org/t/18-1-8-released/79725

llvm/test/CodeGen/AMDGPU/required-export-priority.mir

Reduce run line complexity.

llvm/llvm-project#99273 fossil-db (gfx1150): Totals from 73996 (93.20% of 79395) affected shaders: Instrs: 36015357 -> 36807177 (+2.20%) CodeSize: 189072544 -> 192238748 (+1.67%) Latency: 245845181 -> 246790550 (+0.38%); split: -0.00%, +0.38% InvThroughput: 45068018 -> 45116177 (+0.11%); split: -0.00%, +0.11% Signed-off-by: Rhys Perry <[email protected]> Reviewed-by: Daniel Schürmann <[email protected]> Backport-to: 24.2 Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30241>

llvm/llvm-project#99273 fossil-db (gfx1150): Totals from 73996 (93.20% of 79395) affected shaders: Instrs: 36015357 -> 36807177 (+2.20%) CodeSize: 189072544 -> 192238748 (+1.67%) Latency: 245845181 -> 246790550 (+0.38%); split: -0.00%, +0.38% InvThroughput: 45068018 -> 45116177 (+0.11%); split: -0.00%, +0.11% Signed-off-by: Rhys Perry <[email protected]> Reviewed-by: Daniel Schürmann <[email protected]> Backport-to: 24.2 Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30241> (cherry picked from commit 0919ce1)

Summary: On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports. Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251077

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports. Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports. (cherry picked from commit 939a662)

[AMDGPU] Implement workaround for GFX11.5 export priority

9105216

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports. Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports.

perlfu requested review from jayfoad, arsenm, rampitec and dstutt July 17, 2024 06:21

llvmbot added the backend:AMDGPU label Jul 17, 2024

arsenm reviewed Jul 17, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/required-export-priority.ll Outdated Show resolved Hide resolved

Remove -verify-machineinstrs from test.

8ea44e6

arsenm requested a review from Pierre-vh July 17, 2024 07:25

Add to GFX11 generic target.

7db7ed4

arsenm reviewed Jul 18, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/required-export-priority.mir Outdated Show resolved Hide resolved

Add generic target to test.

9bce5a6

Reduce run line complexity.

arsenm approved these changes Jul 23, 2024

View reviewed changes

perlfu merged commit 939a662 into llvm:main Jul 23, 2024
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Implement workaround for GFX11.5 export priority #99273

[AMDGPU] Implement workaround for GFX11.5 export priority #99273

Uh oh!

perlfu commented Jul 17, 2024

Uh oh!

llvmbot commented Jul 17, 2024

Uh oh!

arsenm Jul 17, 2024

Uh oh!

perlfu Jul 17, 2024

Uh oh!

arsenm Jul 17, 2024

Uh oh!

Pierre-vh Jul 17, 2024

Uh oh!

perlfu Jul 17, 2024

Uh oh!

arsenm Jul 17, 2024

Uh oh!

arsenm Jul 17, 2024

Uh oh!

Pierre-vh Jul 17, 2024

Uh oh!

perlfu Jul 17, 2024

Uh oh!

Uh oh!

marekolsak commented Jul 17, 2024

Uh oh!

jayfoad commented Jul 17, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[AMDGPU] Implement workaround for GFX11.5 export priority #99273

[AMDGPU] Implement workaround for GFX11.5 export priority #99273

Uh oh!

Conversation

perlfu commented Jul 17, 2024

Uh oh!

llvmbot commented Jul 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marekolsak commented Jul 17, 2024

Uh oh!

jayfoad commented Jul 17, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!