Skip to content

[AMDGPU] Implement workaround for GFX11.5 export priority #99273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 23, 2024

Conversation

perlfu
Copy link
Contributor

@perlfu perlfu commented Jul 17, 2024

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports.

On GFX11.5 shaders having completed exports need to execute/wait
at a lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export
and drop to priority 0 after exports.
@llvmbot
Copy link
Member

llvmbot commented Jul 17, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

On GFX11.5 shaders having completed exports need to execute/wait at a lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and drop to priority 0 after exports.


Patch is 28.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99273.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+12-3)
  • (modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+112)
  • (modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h (+1)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+3)
  • (added) llvm/test/CodeGen/AMDGPU/required-export-priority.ll (+344)
  • (added) llvm/test/CodeGen/AMDGPU/required-export-priority.mir (+293)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index dfc8eaea66f7b..14fcf6a210a78 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -947,6 +947,12 @@ def FeatureHasRestrictedSOffset : SubtargetFeature<"restricted-soffset",
   "Has restricted SOffset (immediate not supported)."
 >;
 
+def FeatureRequiredExportPriority : SubtargetFeature<"required-export-priority",
+  "HasRequiredExportPriority",
+  "true",
+  "Export priority must be explicitly manipulated on GFX11.5"
+>;
+
 //===------------------------------------------------------------===//
 // Subtarget Features (options and debugging)
 //===------------------------------------------------------------===//
@@ -1597,20 +1603,23 @@ def FeatureISAVersion11_5_0 : FeatureSet<
   !listconcat(FeatureISAVersion11_Common.Features,
     [FeatureSALUFloatInsts,
      FeatureDPPSrc1SGPR,
-     FeatureVGPRSingleUseHintInsts])>;
+     FeatureVGPRSingleUseHintInsts,
+     FeatureRequiredExportPriority])>;
 
 def FeatureISAVersion11_5_1 : FeatureSet<
   !listconcat(FeatureISAVersion11_Common.Features,
     [FeatureSALUFloatInsts,
      FeatureDPPSrc1SGPR,
      FeatureVGPRSingleUseHintInsts,
-     Feature1_5xVGPRs])>;
+     Feature1_5xVGPRs,
+     FeatureRequiredExportPriority])>;
 
 def FeatureISAVersion11_5_2 : FeatureSet<
   !listconcat(FeatureISAVersion11_Common.Features,
     [FeatureSALUFloatInsts,
      FeatureDPPSrc1SGPR,
-     FeatureVGPRSingleUseHintInsts])>;
+     FeatureVGPRSingleUseHintInsts,
+     FeatureRequiredExportPriority])>;
 
 def FeatureISAVersion12 : FeatureSet<
   [FeatureGFX12,
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index a402fc6d7e611..a8b171aa82840 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -14,6 +14,7 @@
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
 #include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/ScheduleDAG.h"
 #include "llvm/TargetParser/TargetParser.h"
@@ -1104,6 +1105,7 @@ void GCNHazardRecognizer::fixHazards(MachineInstr *MI) {
   fixWMMAHazards(MI);
   fixShift64HighRegBug(MI);
   fixVALUMaskWriteHazard(MI);
+  fixRequiredExportPriority(MI);
 }
 
 bool GCNHazardRecognizer::fixVcmpxPermlaneHazards(MachineInstr *MI) {
@@ -2895,3 +2897,113 @@ bool GCNHazardRecognizer::fixVALUMaskWriteHazard(MachineInstr *MI) {
 
   return true;
 }
+
+static bool ensureEntrySetPrio(MachineFunction *MF, int Priority,
+                               const SIInstrInfo &TII) {
+  MachineBasicBlock &EntryMBB = MF->front();
+  if (EntryMBB.begin() != EntryMBB.end()) {
+    auto &EntryMI = *EntryMBB.begin();
+    if (EntryMI.getOpcode() == AMDGPU::S_SETPRIO &&
+        EntryMI.getOperand(0).getImm() >= Priority)
+      return false;
+  }
+
+  BuildMI(EntryMBB, EntryMBB.begin(), DebugLoc(), TII.get(AMDGPU::S_SETPRIO))
+      .addImm(Priority);
+  return true;
+}
+
+bool GCNHazardRecognizer::fixRequiredExportPriority(MachineInstr *MI) {
+  if (!ST.hasRequiredExportPriority())
+    return false;
+
+  // Assume the following shader types will never have exports,
+  // and avoid adding or adjusting S_SETPRIO.
+  MachineBasicBlock *MBB = MI->getParent();
+  MachineFunction *MF = MBB->getParent();
+  auto CC = MF->getFunction().getCallingConv();
+  switch (CC) {
+  case CallingConv::AMDGPU_CS:
+  case CallingConv::AMDGPU_CS_Chain:
+  case CallingConv::AMDGPU_CS_ChainPreserve:
+  case CallingConv::AMDGPU_KERNEL:
+    return false;
+  default:
+    break;
+  }
+
+  const int MaxPriority = 3;
+  const int NormalPriority = 2;
+  const int PostExportPriority = 0;
+
+  auto It = MI->getIterator();
+  switch (MI->getOpcode()) {
+  case AMDGPU::S_ENDPGM:
+  case AMDGPU::S_ENDPGM_SAVED:
+  case AMDGPU::S_ENDPGM_ORDERED_PS_DONE:
+  case AMDGPU::SI_RETURN_TO_EPILOG:
+    // Ensure shader with calls raises priority at entry.
+    // This ensures correct priority if exports exist in callee.
+    if (MF->getFrameInfo().hasCalls())
+      return ensureEntrySetPrio(MF, NormalPriority, TII);
+    return false;
+  case AMDGPU::S_SETPRIO: {
+    // Raise minimum priority unless in workaround.
+    auto &PrioOp = MI->getOperand(0);
+    int Prio = PrioOp.getImm();
+    bool InWA = (Prio == PostExportPriority) &&
+                (It != MBB->begin() && TII.isEXP(*std::prev(It)));
+    if (InWA || Prio >= NormalPriority)
+      return false;
+    PrioOp.setImm(std::min(Prio + NormalPriority, MaxPriority));
+    return true;
+  }
+  default:
+    if (!TII.isEXP(*MI))
+      return false;
+    break;
+  }
+
+  // Check entry priority at each export (as there will only be a few).
+  // Note: amdgpu_gfx can only be a callee, so defer to caller setprio.
+  bool Changed = false;
+  if (CC != CallingConv::AMDGPU_Gfx)
+    Changed = ensureEntrySetPrio(MF, NormalPriority, TII);
+
+  auto NextMI = std::next(It);
+  bool EndOfShader = false;
+  if (NextMI != MBB->end()) {
+    // Only need WA at end of sequence of exports.
+    if (TII.isEXP(*NextMI))
+      return Changed;
+    // Assume appropriate S_SETPRIO after export means WA already applied.
+    if (NextMI->getOpcode() == AMDGPU::S_SETPRIO &&
+        NextMI->getOperand(0).getImm() == PostExportPriority)
+      return Changed;
+    EndOfShader = NextMI->getOpcode() == AMDGPU::S_ENDPGM;
+  }
+
+  const DebugLoc &DL = MI->getDebugLoc();
+
+  // Lower priority.
+  BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_SETPRIO))
+      .addImm(PostExportPriority);
+
+  if (!EndOfShader) {
+    // Wait for exports to complete.
+    BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_WAITCNT_EXPCNT))
+        .addReg(AMDGPU::SGPR_NULL)
+        .addImm(0);
+  }
+
+  BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_NOP)).addImm(0);
+  BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_NOP)).addImm(0);
+
+  if (!EndOfShader) {
+    // Return to normal (higher) priority.
+    BuildMI(*MBB, NextMI, DL, TII.get(AMDGPU::S_SETPRIO))
+        .addImm(NormalPriority);
+  }
+
+  return true;
+}
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index 3ccca527c626b..f2a64ab48e180 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -107,6 +107,7 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   bool fixWMMAHazards(MachineInstr *MI);
   bool fixShift64HighRegBug(MachineInstr *MI);
   bool fixVALUMaskWriteHazard(MachineInstr *MI);
+  bool fixRequiredExportPriority(MachineInstr *MI);
 
   int checkMAIHazards(MachineInstr *MI);
   int checkMAIHazards908(MachineInstr *MI);
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e5817594a4521..def89c785b855 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -238,6 +238,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool HasVOPDInsts = false;
   bool HasVALUTransUseHazard = false;
   bool HasForceStoreSC0SC1 = false;
+  bool HasRequiredExportPriority = false;
 
   bool RequiresCOV6 = false;
 
@@ -1282,6 +1283,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool hasRestrictedSOffset() const { return HasRestrictedSOffset; }
 
+  bool hasRequiredExportPriority() const { return HasRequiredExportPriority; }
+
   /// \returns true if the target uses LOADcnt/SAMPLEcnt/BVHcnt, DScnt/KMcnt
   /// and STOREcnt rather than VMcnt, LGKMcnt and VScnt respectively.
   bool hasExtendedWaitCounts() const { return getGeneration() >= GFX12; }
diff --git a/llvm/test/CodeGen/AMDGPU/required-export-priority.ll b/llvm/test/CodeGen/AMDGPU/required-export-priority.ll
new file mode 100644
index 0000000000000..377902f3f0d1a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/required-export-priority.ll
@@ -0,0 +1,344 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1150 -amdgpu-enable-vopd=0 -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
+
+define amdgpu_ps void @test_export_zeroes_f32() #0 {
+; GCN-LABEL: test_export_zeroes_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 0
+; GCN-NEXT:    exp mrt0 off, off, off, off
+; GCN-NEXT:    exp mrt0 off, off, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float 0.0, float 0.0, float 0.0, float 0.0, i1 false, i1 false)
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float 0.0, float 0.0, float 0.0, float 0.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_ps void @test_export_en_src0_f32() #0 {
+; GCN-LABEL: test_export_en_src0_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 v3, off, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 1, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_gs void @test_export_gs() #0 {
+; GCN-LABEL: test_export_gs:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_hs void @test_export_hs() #0 {
+; GCN-LABEL: test_export_hs:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_gfx void @test_export_gfx(float %v) #0 {
+; GCN-LABEL: test_export_gfx:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v1, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v2, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v3, 2.0
+; GCN-NEXT:    exp mrt0 off, v3, off, off done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float %v, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_cs void @test_export_cs() #0 {
+; GCN-LABEL: test_export_cs:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_kernel void @test_export_kernel() #0 {
+; GCN-LABEL: test_export_kernel:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_mov_b32_e32 v0, 4.0
+; GCN-NEXT:    v_mov_b32_e32 v1, 0.5
+; GCN-NEXT:    v_mov_b32_e32 v2, 2.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 1.0
+; GCN-NEXT:    exp mrt0 off, v2, off, off done
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 2, float 1.0, float 2.0, float 0.5, float 4.0, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_gfx void @test_no_export_gfx(float %v) #0 {
+; GCN-LABEL: test_no_export_gfx:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  ret void
+}
+
+define amdgpu_ps void @test_no_export_ps(float %v) #0 {
+; GCN-LABEL: test_no_export_ps:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_endpgm
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB9_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB9_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 false, i1 false)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_vm_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_vm_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB10_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB10_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 false, i1 true)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_done_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_done_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB11_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4 done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB11_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 false)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_if_export_vm_done_f32(i32 %flag, float %x, float %y, float %z, float %w) #0 {
+; GCN-LABEL: test_if_export_vm_done_f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s0, exec_lo
+; GCN-NEXT:    v_cmpx_ne_u32_e32 0, v0
+; GCN-NEXT:    s_cbranch_execz .LBB12_2
+; GCN-NEXT:  ; %bb.1: ; %exp
+; GCN-NEXT:    exp mrt0 v1, v2, v3, v4 done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:  .LBB12_2: ; %end
+; GCN-NEXT:    s_endpgm
+  %cc = icmp eq i32 %flag, 0
+  br i1 %cc, label %end, label %exp
+
+exp:
+  call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
+  br label %end
+
+end:
+  ret void
+}
+
+define amdgpu_ps void @test_export_pos_before_param_across_load(i32 %idx) #0 {
+; GCN-LABEL: test_export_pos_before_param_across_load:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen
+; GCN-NEXT:    v_mov_b32_e32 v1, 0
+; GCN-NEXT:    v_mov_b32_e32 v2, 1.0
+; GCN-NEXT:    v_mov_b32_e32 v3, 0.5
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    exp pos0 v1, v1, v1, v0 done
+; GCN-NEXT:    exp invalid_target_32 v2, v2, v2, v2
+; GCN-NEXT:    exp invalid_target_33 v2, v2, v2, v3
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float 1.0, float 1.0, float 1.0, float 1.0, i1 false, i1 false)
+  call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float 1.0, float 1.0, float 1.0, float 0.5, i1 false, i1 false)
+  %load = call float @llvm.amdgcn.raw.ptr.buffer.load.f32(ptr addrspace(8) undef, i32 %idx, i32 0, i32 0)
+  call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float %load, i1 true, i1 false)
+  ret void
+}
+
+define amdgpu_ps void @test_export_across_store_load(i32 %idx, float %v) #0 {
+; GCN-LABEL: test_export_across_store_load:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    v_mov_b32_e32 v2, 16
+; GCN-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v0
+; GCN-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GCN-NEXT:    v_cndmask_b32_e64 v0, v2, 0, vcc_lo
+; GCN-NEXT:    v_mov_b32_e32 v2, 0
+; GCN-NEXT:    scratch_store_b32 v0, v1, off
+; GCN-NEXT:    scratch_load_b32 v0, off, off
+; GCN-NEXT:    v_mov_b32_e32 v1, 1.0
+; GCN-NEXT:    exp pos0 v2, v2, v2, v1 done
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_waitcnt_expcnt null, 0x0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    exp invalid_target_32 v0, v2, v1, v2
+; GCN-NEXT:    exp invalid_target_33 v0, v2, v1, v2
+; GCN-NEXT:    s_setprio 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_nop 0
+; GCN-NEXT:    s_endpgm
+  %data0 = alloca <4 x float>, align 8, addrspace(5)
+  %data1 = alloca <4 x float>, align 8, addrspace(5)
+  %cmp = icmp eq i32 %idx, 1
+  %data = select i1 %cmp, ptr addrspace(5) %data0, ptr addrspace(5) %data1
+  store float %v, ptr addrspace(5) %data, align 8
+  call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float 1.0, i1 true, i1 false)
+  %load0 = load float, ptr addrspace(5) %data0, align 8
+  call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float %load0, float 0.0, float 1.0, float 0.0, i1 false, i1 false)
+  call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float %load0, float 0.0, float 1.0, float 0.0, i1 false, i1 false)
+  ret void
+}
+
+define amdgpu_ps void @test_export_in_callee(float %v) #0 {
+; GCN-LABEL: test_export_in_callee:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_getpc_b64 s[0:1]
+; GCN-NEXT:    s_add_u32 s0, s0, test_export_gfx@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s1, s1, test_export_gfx@gotpcrel32@hi+12
+; GCN-NEXT:    v_add_f32_e32 v0, 1.0, v0
+; GCN-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GCN-NEXT:    s_endpgm
+  %x = fadd float %v, 1.0
+  call void @test_export_gfx(float %x)
+  ret void
+}
+
+define amdgpu_ps void @test_export_in_callee_prio(float %v) #0 {
+; GCN-LABEL: test_export_in_callee_prio:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    v_add_f32_e32 v0, 1.0, v0
+; GCN-NEXT:    s_setprio 2
+; GCN-NEXT:    s_getpc_b64 s[0:1]
+; GCN-NEXT:    s_add_u32 s0, s0, test_export_gfx@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s1, s1, test_export_gfx@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GCN-NEXT:    s_endpgm
+  %x = fadd float %v, 1.0
+  call void @llvm.amdgcn.s.setprio(i16 0)
+  call void @test_export_gfx(float %x)
+  ret void
+}
+
+declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #1
+declare void @llvm.amdgcn.exp.i32(i32, i32, i32, i32, i32, i32, i1, i1) #1
+declare float @llvm.amdgcn.raw.ptr.buffer.load.f32(ptr addrspace(8), i32, i32, i32) #2
+declare void @llvm.amdgcn.s.setprio(i16)
+
+attributes #0 = { nounwind }
+attributes #1 = { nounwind inaccessiblememonly }
+attributes #2 = { nounwind readnone }
diff --git a/llvm/test/CodeGen/AMDGPU/required-export-priority.mir b/llvm/test/CodeGen/AMDGPU/required-export-priority.mir
new file mode 100644
index 0000000000000..eee04468036e5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/required-expo...
[truncated]

@@ -1597,20 +1603,23 @@ def FeatureISAVersion11_5_0 : FeatureSet<
!listconcat(FeatureISAVersion11_Common.Features,
[FeatureSALUFloatInsts,
FeatureDPPSrc1SGPR,
FeatureVGPRSingleUseHintInsts])>;
FeatureVGPRSingleUseHintInsts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug workaround also need to be added to the generic target

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a generic target for 11.5, these don't apply to all of 11, only the three 11.5 targets definitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ISAVersion11_Generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gfx11-generic includes gfx115x.

I don't think there is a generic target for 11.5, these don't apply to all of 11, only the three 11.5 targets definitions.

If this is a codegen pessimization to fix a bug, and it's not incompatible with gfx110x (just makes them slower), then it needs to go in the generic target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

There seems to be some inconsistency with generics in general.
GFX10 has separate 10.1, 10.3 generic targets, I guess because .0 is not defined.
GFX 9 generic target doesn't follow this workaround absorbing convention, and 9.4 doesn't have a generic target.
I guess splitting GFX11 to .0 and .5 now would not make sense if someone was using the gfx11-generic target, as it would still need to stay defined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GFX10 has separate 10.1, 10.3 generic targets

It's inconsistent because the hardware is inconsistent. Ideally they wouldn't have broken compatibility so badly between minor gfx10 revisions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GFX 9 generic target doesn't follow this workaround absorbing convention,

It certainly is supposed to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some inconsistency with generics in general.

We have 2 families for GFX10 because GFX101x and GFX103x have incompatible ISAs due to encoding changes.

GFX10 has separate 10.1, 10.3 generic targets, I guess because .0 is not defined.

The naming convention is just:

  • If we have one family for the whole generation, just use the family name (gfx9-generic)
  • If we have more, then name each target after the earlier/lowest number in the family (gfx101x -> gfx10.1-generic)

GFX 9 generic target doesn't follow this workaround absorbing convention

What do you mean? We should have all workarounds included in gfx9-generic, if not, that's an issue.
Can you please tell me which workaround you think is missing? I'll follow up on it.

and 9.4 doesn't have a generic target.

Generic targets are for the case where one decides that performance isn't essential (or ideal performance cannot be achieved anyway), and it just needs to work and get up and running fast (= easy to distribute). In those cases, building only 2 targets for GFX10 instead of 11 is a huge time and space saver.

It makes little to no sense for accelerators (e.g; gfx90a) to have generic targets for multiple reasons:

  • They're designed for max performance on compute workloads. Performance >>> build times here.
  • It's pretty much impossible to make a generic target for, e.g. gfx9 that includes MI. The biggest issue IIRC is that the memory model is radically different between consumer and MI gfx9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GFX 9 generic target doesn't follow this workaround absorbing convention

What do you mean? We should have all workarounds included in gfx9-generic, if not, that's an issue. Can you please tell me which workaround you think is missing? I'll follow up on it.

Thanks!
On a further read I was tripped up by the workarounds being in FeatureISAVersion9_0_Consumer_Common which is included in FeatureISAVersion9_Generic, rather than directly being in FeatureISAVersion9_Generic.
Your explanation is helpful to understand that "gfx9-generic" basic means "gfx9-consumer-generic"?
i.e. FeatureISAVersion9_0_Consumer_Common = FeatureISAVersion9_Generic = "gfx9-generic".
With the MI parts not being in this target, but still apply some of the same workarounds.

I had also forgotten binaries are built for multiple targets in the compute chain; hence why you don't want more targets.

@arsenm arsenm requested a review from Pierre-vh July 17, 2024 07:25
@marekolsak
Copy link

Can this also be backported to LLVM 18?

@jayfoad
Copy link
Contributor

jayfoad commented Jul 17, 2024

Can this also be backported to LLVM 18?

No more 18.x releases are planned: https://discourse.llvm.org/t/18-1-8-released/79725

Reduce run line complexity.
@perlfu perlfu merged commit 939a662 into llvm:main Jul 23, 2024
5 of 7 checks passed
temeraire-cx pushed a commit to chaotic-cx/mesa-mirror that referenced this pull request Jul 23, 2024
llvm/llvm-project#99273

fossil-db (gfx1150):
Totals from 73996 (93.20% of 79395) affected shaders:
Instrs: 36015357 -> 36807177 (+2.20%)
CodeSize: 189072544 -> 192238748 (+1.67%)
Latency: 245845181 -> 246790550 (+0.38%); split: -0.00%, +0.38%
InvThroughput: 45068018 -> 45116177 (+0.11%); split: -0.00%, +0.11%

Signed-off-by: Rhys Perry <[email protected]>
Reviewed-by: Daniel Schürmann <[email protected]>
Backport-to: 24.2
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30241>
temeraire-cx pushed a commit to chaotic-cx/mesa-mirror that referenced this pull request Jul 23, 2024
llvm/llvm-project#99273

fossil-db (gfx1150):
Totals from 73996 (93.20% of 79395) affected shaders:
Instrs: 36015357 -> 36807177 (+2.20%)
CodeSize: 189072544 -> 192238748 (+1.67%)
Latency: 245845181 -> 246790550 (+0.38%); split: -0.00%, +0.38%
InvThroughput: 45068018 -> 45116177 (+0.11%); split: -0.00%, +0.11%

Signed-off-by: Rhys Perry <[email protected]>
Reviewed-by: Daniel Schürmann <[email protected]>
Backport-to: 24.2
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30241>
(cherry picked from commit 0919ce1)
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
On GFX11.5 shaders having completed exports need to execute/wait at a
lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and
drop to priority 0 after exports.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251077
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 13, 2025
On GFX11.5 shaders having completed exports need to execute/wait at a
lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and
drop to priority 0 after exports.

(cherry picked from commit 939a662)
rocm-ci pushed a commit to ROCm/llvm-project that referenced this pull request Apr 11, 2025
On GFX11.5 shaders having completed exports need to execute/wait at a
lower priority than shaders still executing exports.
Add code to maintain normal priority of 2 for shaders that export and
drop to priority 0 after exports.

(cherry picked from commit 939a662)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants