[LSV] Insert casts to vectorize mismatched types #134436

gandhi56 · 2025-04-04T19:06:50Z

After collecting equivalence classes, loop over each distinct pair of them and check if they could be merged into one.

Consider classes A and B such that their leaders differ only by their scalar bitwidths. We do not yet merge them otherwise. Let N be the scalar bitwidth of the leader instruction in A. Iterate over all instructions in B and ensure their total bitwidths match the total bitwidth of the leader instruction of A. Finally, cast each instruction in B with a mismatched type to a pointer, integer or floating point type.

Resolve issue #97715

llvmbot · 2025-04-04T19:07:25Z

@llvm/pr-subscribers-backend-nvptx
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Anshil Gandhi (gandhi56)

Changes

After collecting equivalence classes, loop over each distinct pair of them and check if they could be merged into one.

Consider classes A and B such that their leaders differ only by their scalar bitwidth. (We do not merge them otherwise.) Let N be the scalar bitwidth of the leader instruction in A. Iterate over all instructions in B and ensure their total bitwidths match the total bitwidth of the leader instruction of A. Finally, cast each instruction in B with a mismatched type to an intN type.

Full diff: https://github.com/llvm/llvm-project/pull/134436.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+81-1)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+27-3)

diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 04b392829f0d7..c94f10fb8b855 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -324,6 +324,10 @@ class Vectorizer {
       Instruction *ChainElem, Instruction *ChainBegin,
       const DenseMap<Instruction *, APInt /*OffsetFromLeader*/> &ChainOffsets);
 
+  /// Merge the equivalence classes if casts could be inserted in one to match
+  /// the scalar bitwidth of the instructions in the other class.
+  void insertCastsToMergeClasses(EquivalenceClassMap &EQClasses);
+
   /// Merges the equivalence classes if they have underlying objects that differ
   /// by one level of indirection (i.e., one is a getelementptr and the other is
   /// the base pointer in that getelementptr).
@@ -1310,6 +1314,82 @@ std::optional<APInt> Vectorizer::getConstantOffsetSelects(
   return std::nullopt;
 }
 
+void Vectorizer::insertCastsToMergeClasses(EquivalenceClassMap &EQClasses) {
+  if (EQClasses.size() < 2)
+    return;
+
+  // Loop over all equivalence classes and try to merge them. Keep track of
+  // classes that are merged into others.
+  DenseSet<EqClassKey> ClassesToErase;
+  for (auto EC1 : EQClasses) {
+    for (auto EC2 : EQClasses) {
+      if (ClassesToErase.contains(EC2.first) || EC1 <= EC2)
+        continue;
+
+      auto [Ptr1, AS1, TySize1, IsLoad1] = EC1.first;
+      auto [Ptr2, AS2, TySize2, IsLoad2] = EC2.first;
+
+      // Attempt to merge EC2 into EC1. Skip if the pointers, address spaces or
+      // whether the leader instruction is a load/store are different. Also skip
+      // if the scalar bitwidth of the first equivalence class is smaller than
+      // the second one to avoid reconsidering the same equivalence class pair.
+      if (Ptr1 != Ptr2 || AS1 != AS2 || IsLoad1 != IsLoad2 || TySize1 < TySize2)
+        continue;
+
+      // Ensure all instructions in EC2 can be bitcasted into NewTy.
+      /// TODO: NewTyBits is needed as stuctured binded variables cannot be
+      /// captured by a lambda until C++20.
+      auto NewTyBits = std::get<2>(EC1.first);
+      if (any_of(EC2.second, [&](Instruction *I) {
+            return DL.getTypeSizeInBits(getLoadStoreType(I)) != NewTyBits;
+          }))
+        continue;
+
+      // Create a new type for the equivalence class.
+      /// TODO: NewTy should be an FP type for an all-FP equivalence class.
+      auto *NewTy = Type::getIntNTy(EC2.second[0]->getContext(), NewTyBits);
+      for (auto *Inst : EC2.second) {
+        auto *Ptr = getLoadStorePointerOperand(Inst);
+        auto *OrigTy = Inst->getType();
+        if (OrigTy == NewTy)
+          continue;
+        if (auto *LI = dyn_cast<LoadInst>(Inst)) {
+          Builder.SetInsertPoint(LI->getIterator());
+          auto *NewLoad = Builder.CreateLoad(NewTy, Ptr);
+          auto *Cast = Builder.CreateBitOrPointerCast(
+              NewLoad, OrigTy, NewLoad->getName() + ".cast");
+          LI->replaceAllUsesWith(Cast);
+          LI->eraseFromParent();
+          EQClasses[EC1.first].emplace_back(NewLoad);
+        } else {
+          auto *SI = cast<StoreInst>(Inst);
+          Builder.SetInsertPoint(SI->getIterator());
+          auto *Cast = Builder.CreateBitOrPointerCast(
+              SI->getValueOperand(), NewTy,
+              SI->getValueOperand()->getName() + ".cast");
+          auto *NewStore = Builder.CreateStore(
+              Cast, getLoadStorePointerOperand(SI), SI->isVolatile());
+          SI->eraseFromParent();
+          EQClasses[EC1.first].emplace_back(NewStore);
+        }
+      }
+
+      // Sort the instructions in the equivalence class by their order in the
+      // basic block. This is important to ensure that the instructions are
+      // vectorized in the correct order.
+      std::sort(EQClasses[EC1.first].begin(), EQClasses[EC1.first].end(),
+                [](Instruction *A, Instruction *B) {
+                  return A && B && A->comesBefore(B);
+                });
+      ClassesToErase.insert(EC2.first);
+    }
+  }
+
+  // Erase the equivalence classes that were merged into others.
+  for (auto Key : ClassesToErase)
+    EQClasses.erase(Key);
+}
+
 void Vectorizer::mergeEquivalenceClasses(EquivalenceClassMap &EQClasses) const {
   if (EQClasses.size() < 2) // There is nothing to merge.
     return;
@@ -1495,7 +1575,7 @@ Vectorizer::collectEquivalenceClasses(BasicBlock::iterator Begin,
          /*IsLoad=*/LI != nullptr}]
         .emplace_back(&I);
   }
-
+  insertCastsToMergeClasses(Ret);
   mergeEquivalenceClasses(Ret);
   return Ret;
 }
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll
index ede2e4066c263..c364bc2da4c5d 100644
--- a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll
+++ b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll
@@ -95,10 +95,10 @@ entry:
   ret void
 }
 
-; Ideally this would be merged
 ; CHECK-LABEL: @merge_load_i32_v2i16(
-; CHECK: load i32,
-; CHECK: load <2 x i16>
+; CHECK: load <2 x i32>
+; CHECK: extractelement <2 x i32> %0, i32 0
+; CHECK: extractelement <2 x i32> %0, i32 1
 define amdgpu_kernel void @merge_load_i32_v2i16(ptr addrspace(1) nocapture %a) #0 {
 entry:
   %a.1 = getelementptr inbounds i32, ptr addrspace(1) %a, i32 1
@@ -111,3 +111,27 @@ entry:
 
 attributes #0 = { nounwind }
 attributes #1 = { nounwind readnone }
+
+; CHECK-LABEL: @merge_i32_2i16_float_4i8(
+; CHECK: load <4 x i32>
+; CHECK: store <2 x i32>
+; CHECK: store <2 x i32>
+define void @merge_i32_2i16_float_4i8(ptr addrspace(1) %ptr1, ptr addrspace(2) %ptr2) {
+  %gep1 = getelementptr inbounds i32, ptr addrspace(1) %ptr1, i64 0
+  %load1 = load i32, ptr addrspace(1) %gep1, align 4
+  %gep2 = getelementptr inbounds <2 x i16>, ptr addrspace(1) %ptr1, i64 1
+  %load2 = load <2 x i16>, ptr addrspace(1) %gep2, align 4
+  %gep3 = getelementptr inbounds float, ptr addrspace(1) %ptr1, i64 2
+  %load3 = load float, ptr addrspace(1) %gep3, align 4
+  %gep4 = getelementptr inbounds <4 x i8>, ptr addrspace(1) %ptr1, i64 3
+  %load4 = load <4 x i8>, ptr addrspace(1) %gep4, align 4
+  %store.gep1 = getelementptr inbounds i32, ptr addrspace(2) %ptr2, i64 0
+  store i32 %load1, ptr addrspace(2) %store.gep1, align 4
+  %store.gep2 = getelementptr inbounds <2 x i16>, ptr addrspace(2) %ptr2, i64 1
+  store <2 x i16> %load2, ptr addrspace(2) %store.gep2, align 4
+  %store.gep3 = getelementptr inbounds float, ptr addrspace(2) %ptr2, i64 2
+  store float %load3, ptr addrspace(2) %store.gep3, align 4
+  %store.gep4 = getelementptr inbounds <4 x i8>, ptr addrspace(2) %ptr2, i64 3
+  store <4 x i8> %load4, ptr addrspace(2) %store.gep4, align 4
+  ret void
+}

github-actions · 2025-04-09T19:23:22Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

arsenm · 2025-04-13T08:28:31Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+        if (NewTyBits == 16)
+          NewTy = Type::getHalfTy(Ctx);
+        else if (NewTyBits == 32)
+          NewTy = Type::getFloatTy(Ctx);
+        else if (NewTyBits == 64)
+          NewTy = Type::getDoubleTy(Ctx);


Really should avoid doing bitwidth -> fptype. This doesn't consider bfloat

So an all-FP class should only be merged with another all-FP class, correct?

I think it's best to avoid casting to bfloat as the bit representation would be different from any-type to bfloat.

This is only performing no-op casts, the interpretation of the type does not matter. The specific FP type or not doesn't matter. If the different pieces have a common element type, you can use that element type to produce a tidier final result with minimized casting

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

llvm/test/CodeGen/AMDGPU/bitop3.ll

llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll

This commit adds tests to introduce bitcasts to vectorize loads and stores.

This commit adds tests to introduce bitcasts for increased vectorization of loads and stores. NFC.

After collecting equivalence classes, loop over each distinct pair of them and check if they could be merged into one. Consider classes A and B such that their leaders differ only by their scalar bitwidths. We do not yet merge them otherwise. Let N be the scalar bitwidth of the leader instruction in A. Iterate over all instructions in B and ensure their total bitwidths match the total bitwidth of the leader instruction of A. Finally, cast each instruction in B with a mismatched type to a pointer, integer or floating point type. Resolve issue llvm#97715 Change-Id: Ib64fd98de5c908262947648ad14dc53b61814642

arsenm · 2025-05-01T13:53:21Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+        if (NewTyBits == 16)
+          NewTy = Type::getHalfTy(Ctx);
+        else if (NewTyBits == 32)
+          NewTy = Type::getFloatTy(Ctx);
+        else if (NewTyBits == 64)
+          NewTy = Type::getDoubleTy(Ctx);


This is only performing no-op casts, the interpretation of the type does not matter. The specific FP type or not doesn't matter. If the different pieces have a common element type, you can use that element type to produce a tidier final result with minimized casting

arsenm · 2025-05-01T13:53:44Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+        NewTy = PointerType::get(Ctx, AS2);
+      }
+
+      for (auto *Inst : EC2.second) {


arsenm · 2025-05-01T13:54:27Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+      if (Ptr1 != Ptr2 || AS1 != AS2 || IsLoad1 != IsLoad2 || TySize1 < TySize2)
+        continue;
+
+      // An All-FP class should only be merged into another All-FP class.


FPNess isn't a fundamentally interesting property

arsenm · 2025-05-01T13:55:14Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+    auto CommonTypeKind = [](Instruction *I) {
+      if (I->getType()->isIntOrIntVectorTy())
+        return 0;
+      if (I->getType()->isFPOrFPVectorTy())
+        return 1;
+      if (I->getType()->isPtrOrPtrVectorTy())
+        return 2;
+      return -1; // Invalid type kind
+    };


Lambda overuse, move to separate utility

arsenm · 2025-05-01T13:56:13Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+  auto CopyMetaDataFromTo = [&](Instruction *Src, Instruction *Dst) {
+    SmallVector<std::pair<unsigned, MDNode *>, 4> MD;
+    Src->getAllMetadata(MD);
+    for (const auto [ID, Node] : MD) {
+      Dst->setMetadata(ID, Node);
+    }
+  };


Move to separate utility function, this is also not correct in general. It's not legal to blindly preserve all the metadata. You need something like copyMetadataForLoad

arsenm · 2025-05-01T13:59:52Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+              SI->getValueOperand()->getName() + ".cast");
+          auto *NewStore = Builder.CreateStore(
+              Cast, getLoadStorePointerOperand(SI), SI->isVolatile());
+          CopyMetaDataFromTo(SI, NewStore);


Somehow we don't seem to have the corresponding copyMetadataForStore

arsenm · 2025-05-01T14:00:57Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

@@ -1495,7 +1629,7 @@ Vectorizer::collectEquivalenceClasses(BasicBlock::iterator Begin,
         /*IsLoad=*/LI != nullptr}]
        .emplace_back(&I);
  }
-
+  insertCastsToMergeClasses(Ret);


This is eagerly mutating the IR before vectorization is performed? Should try to only select a type, and coerce as part of the final vectorization

After merging equivalence classes, LSV converts them into chains at which point it is too late to introduce cast instructions.

Why is it too late to introduce cast instructions at that point?

Theoretically, it may be possible. The necessary changes seem cumbersome to me, however. After collecting classes, LSV extracts chains from each class. These chains are split based on contiguity, alignment and MayAlias instructions, before vectorization. Merging chains after these splits would require careful handling of their instructions as vectorizeChain makes certain assumptions before determining the type of vectorized load/store.

I prefer to insert casts alongside mergeEquivalenceClasses(..) as gatherChains already understands what kinds of chains are handlable by vectorizeChain.

It's important to not make pointless IR changes, and only do this if it vectorization will occur

That's a good point, I will postpone the merging of class until vectorization then. Thanks!

llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll

Related PR: #134436

Autogenerate checks for merge-vectors.ll and introduce merge-vectors-complex.ll with mismatched types. Related PR: #134436 This is a reland of #138155, which was reverted due to missed nits.

llvmbot added backend:AMDGPU vectorizers llvm:transforms labels Apr 4, 2025

gandhi56 mentioned this pull request Apr 4, 2025

[LSV] Load Store Vectorizer failed to combine two related loads. #97715

Open

gandhi56 requested review from arsenm, jayfoad, shiltian and cdevadas April 5, 2025 09:19

gandhi56 force-pushed the issue-97715/lsv-insert-casts branch 3 times, most recently from 87f500d to a37da6c Compare April 9, 2025 19:20

gandhi56 force-pushed the issue-97715/lsv-insert-casts branch from a37da6c to 9d03603 Compare April 9, 2025 23:23

llvmbot added the backend:NVPTX label Apr 9, 2025

gandhi56 requested a review from broxigarchen April 10, 2025 18:24

arsenm reviewed Apr 13, 2025

View reviewed changes

gandhi56 self-assigned this Apr 13, 2025

gandhi56 added 2 commits April 28, 2025 10:02

[NFC][LSV] Precommit tests

091d35a

This commit adds tests to introduce bitcasts to vectorize loads and stores.

[LSV] Precommit tests

d3097b7

This commit adds tests to introduce bitcasts for increased vectorization of loads and stores. NFC.

gandhi56 force-pushed the issue-97715/lsv-insert-casts branch 2 times, most recently from b6c0469 to 706665f Compare April 29, 2025 02:36

gandhi56 force-pushed the issue-97715/lsv-insert-casts branch from 706665f to 3024610 Compare April 30, 2025 19:38

Merge branch 'main' into issue-97715/lsv-insert-casts

f32cd6d

arsenm reviewed May 1, 2025

View reviewed changes

gandhi56 mentioned this pull request May 1, 2025

[NFC] Precommit: Autogenerate checks for an LSV test #138155

Merged

gandhi56 added a commit that referenced this pull request May 1, 2025

[NFC] Precommit: Autogenerate checks for an LSV test (#138155)

0e9740e

Related PR: #134436

gandhi56 mentioned this pull request May 1, 2025

[NFC] Precommit tests for an LSV patch #138167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LSV] Insert casts to vectorize mismatched types #134436

[LSV] Insert casts to vectorize mismatched types #134436

gandhi56 commented Apr 4, 2025 •

edited

Loading

llvmbot commented Apr 4, 2025 •

edited

Loading

github-actions bot commented Apr 9, 2025 •

edited

Loading

arsenm Apr 13, 2025

gandhi56 Apr 13, 2025

gandhi56 Apr 25, 2025

arsenm May 1, 2025

arsenm May 1, 2025

arsenm May 1, 2025

arsenm May 1, 2025

arsenm May 1, 2025

arsenm May 1, 2025

arsenm May 1, 2025

arsenm May 1, 2025

gandhi56 May 2, 2025

nikic May 2, 2025

gandhi56 May 5, 2025

arsenm May 5, 2025

gandhi56 May 5, 2025

[LSV] Insert casts to vectorize mismatched types #134436

Are you sure you want to change the base?

[LSV] Insert casts to vectorize mismatched types #134436

Conversation

gandhi56 commented Apr 4, 2025 • edited Loading

llvmbot commented Apr 4, 2025 • edited Loading

github-actions bot commented Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandhi56 commented Apr 4, 2025 •

edited

Loading

llvmbot commented Apr 4, 2025 •

edited

Loading

github-actions bot commented Apr 9, 2025 •

edited

Loading