New pass Reduce variable liveness #3965

mfrancepillois · 2025-04-18T10:52:41Z

Add a new pass to reduce the variable liveness by prefetching data then moving load op closer to use-op.
Add test.

…d-op closer to use-op. Add a test.

mfrancepillois · 2025-04-18T10:53:12Z

Performance improvement for FA on PVC1550:

etiotto · 2025-04-23T17:37:03Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceRegisterPressure.cpp

+namespace {
+
+/// Return true if the lifespan of the V value is considered long.
+static bool isLongLifeSpanVariable(Value v, Block *useBlock) {


The heuristic is pretty crude. At some point I wrote an analysis to estimate the live range of a variable. The analysis is https://github.com/intel/intel-xpu-backend-for-triton/blob/main/third_party/intel/include/Analysis/Liveness.h. I am wondering whether we should attempt to use that analysis to collect the live ranges of variables in a loop. Then sink variables that have a live range that is "too big".

Can you give that a try ?

Is tricky to get this kind of instruction "scheduling" correct at the Triton level. I kind of feel the low level compiler (e.g. IGC) would have all the information to schedule instructions based on register usage.. Hard to do that well at the abstraction level Triton operates at.

I recognize that at this level, it is difficult to say that we are reducing the register pressure. Instead, we reduce the variable liveness hoping that liveness reduction will allow us to save registers.
I therefore renamed the pass in this way.
I've also based the heuristic on MLIR Liveness analysis.

etiotto · 2025-04-25T14:25:46Z

test/TritonIntelGPU/reduce-variable-liveness.mlir

+    // CHECK:      triton_intel_gpu.prefetch {{.*}} : !tt.ptr<tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>>>
+    // CHECK-NOT:  tt.load {{.*}} : !tt.ptr<tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>>>
+    %1 = tt.make_tensor_ptr %arg1, [%c0_i64, %c0_i64], [%c0_i64, %c0_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x256xf16, #dot1>>
+    %2 = tt.load %0 {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<128x64xf16, #dot0>>


We should not sink the load inside the loop in Triton.. Sinking the load in the loop means that the value is loaded at every loop iteration. Triton doesn't have enough information about register pressure to determine whether this is profitable.

Signed-off-by: Maxime France-Pillois <[email protected]>

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

test/TritonIntelGPU/reduce-variable-liveness.mlir

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

Signed-off-by: Maxime France-Pillois <[email protected]>

whitneywhtsang

There is a loop sink pass in IGC. Can you please create an issue for IGC team to investigate why it doesn't catch the case of FA with the shape that gives the most gain?

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

whitneywhtsang · 2025-05-02T17:15:45Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+
+/// Create a prefetch operation for the given load operation.
+static void createPrefetchOp(tt::LoadOp loadOp) {
+  Operation *op = loadOp.getPtr().getDefiningOp();


when did we check that loadOp.getPtr() is an operation? do we need to add that to isLoadCandidate?
Or should we add the support of when pointer is a region argument?

Thanks for noticing. A check has been added to isLoadCandidate.
As the pass adds a prefetch right after the defining op, I'm concerned that adding this prefetch in another region (in the case the load ptr has been defined in another region) could have side effects on the cache (as an early data fetch could mean evincing data that are still needed).

do we care about the case that the pointer directly come from function argument?

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

chengjunlu · 2025-05-06T02:03:49Z

It is good to have the reduce variable liveness as the beginning for liveness optimization in the Triton middle end.
This PR looks good to me as the beginning.

The optimization relies on the cache to hold the values that we may reuse in the loop. But the cache system is not fully controllable by the program. The better we can enhance it with the usage of shared local memory and make it some how like RegisterToMem pass for general case.

etiotto · 2025-05-06T15:07:38Z

@mfrancepillois can you do a Triton Benchmark run with this PR to identify improvement (or degradations - hopefully none) in all the microbmks we have ?

Signed-off-by: Maxime France-Pillois <[email protected]>

test/TritonIntelGPU/reduce-variable-liveness.mlir

whitneywhtsang · 2025-05-12T17:10:17Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

                            Operation *forOp) {
  // Only pointer to tensor are considered to be moved
-  if (!mlir::triton::isTensorPointerType(loadOp.getPtr().getType()))
+  if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType()))


[optional]

Suggested change

if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType()))

if (!mlir::triton::isTensorPointerType(loadOp.getResult().getType()))

Signed-off-by: Maxime France-Pillois <[email protected]>

whitneywhtsang · 2025-05-13T16:12:58Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+  // Multiple users
+  if (any_of(loadOp->getUsers(), [&](Operation *user) {
+        return ((user->getBlock() == forOp->getBlock()) &&
+                user->isBeforeInBlock(forOp));


What does user->isBeforeInBlock(forOp) mean?
user->getBlock() == forOp->getBlock() means user is part of the loop?

Add new pass that tries to reduce the register pressure by moving loa…

815681b

…d-op closer to use-op. Add a test.

mfrancepillois requested review from whitneywhtsang, etiotto and a team April 18, 2025 10:52

mfrancepillois changed the title ~~Add pass: Reduce the register pressure~~ New pass Reduce register pressure Apr 18, 2025

mfrancepillois linked an issue Apr 18, 2025 that may be closed by this pull request

[FA performance] Improve the Q matrix load stategy #3966

Open

mfrancepillois changed the title ~~New pass Reduce register pressure~~ [Draft] New pass Reduce register pressure Apr 18, 2025

mfrancepillois marked this pull request as draft April 18, 2025 11:40

Fix types mismatch bug.

34955bf

mfrancepillois marked this pull request as ready for review April 18, 2025 16:31

mfrancepillois changed the title ~~[Draft] New pass Reduce register pressure~~ New pass Reduce register pressure Apr 18, 2025

Merge branch 'main' into maxime/reduceRegisterPressure

3915019

etiotto reviewed Apr 23, 2025

View reviewed changes

Rename pass + use liveness analysis for heuristic.

bdf0999

mfrancepillois changed the title ~~New pass Reduce register pressure~~ New pass Reduce variable liveness Apr 24, 2025

mfrancepillois added 2 commits April 24, 2025 17:01

Fix typo

ac4da4b

Merge branch 'main' into maxime/reduceRegisterPressure

f34ef03

etiotto reviewed Apr 25, 2025

View reviewed changes

mfrancepillois added 3 commits April 29, 2025 13:48

Improve heuristic.

ec4394a

Merge branch 'main' into maxime/reduceRegisterPressure

a937fb8

Add new test cases to match new heuristic conditions.

4d17f01

Signed-off-by: Maxime France-Pillois <[email protected]>

whitneywhtsang reviewed Apr 29, 2025

View reviewed changes

mfrancepillois added 6 commits April 30, 2025 15:15

Improves the heuristic + addresses comments

e84b4ca

Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

67aec58

Enforce only ConvertLayoutOp between dot and load + bug fix.

db2c1ed

Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

79ab6ed

Bug fix

6fd22d8

Signed-off-by: Maxime France-Pillois <[email protected]>

remove unused variable.

9428627

Signed-off-by: Maxime France-Pillois <[email protected]>

mfrancepillois marked this pull request as draft April 30, 2025 16:53

Add element type check.

3bae4e6

Signed-off-by: Maxime France-Pillois <[email protected]>

mfrancepillois marked this pull request as ready for review April 30, 2025 17:46

etiotto requested review from alexbaden and chengjunlu May 1, 2025 19:37

whitneywhtsang reviewed May 2, 2025

View reviewed changes

chengjunlu approved these changes May 6, 2025

View reviewed changes

mfrancepillois added 3 commits May 7, 2025 11:34

Address comments: improve code quality.

d92e667

Merge branch 'main' into maxime/reduceRegisterPressure

68f9b48

Extend support to handle tensor of pointers without mask + add test.

3996a60

Signed-off-by: Maxime France-Pillois <[email protected]>

mfrancepillois marked this pull request as draft May 12, 2025 13:11

Merge branch 'main' into maxime/reduceRegisterPressure

6971b37

whitneywhtsang reviewed May 12, 2025

View reviewed changes

test/TritonIntelGPU/reduce-variable-liveness.mlir Outdated Show resolved Hide resolved

whitneywhtsang reviewed May 12, 2025

View reviewed changes

mfrancepillois added 2 commits May 13, 2025 16:54

Add support for multiple users

2da8a5d

Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

ddc692b

whitneywhtsang reviewed May 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New pass Reduce variable liveness #3965

New pass Reduce variable liveness #3965

mfrancepillois commented Apr 18, 2025 •

edited

Loading

mfrancepillois commented Apr 18, 2025

etiotto Apr 23, 2025

etiotto Apr 23, 2025

mfrancepillois Apr 24, 2025

etiotto Apr 25, 2025

whitneywhtsang left a comment

whitneywhtsang May 2, 2025

mfrancepillois May 7, 2025

whitneywhtsang May 7, 2025

chengjunlu commented May 6, 2025 •

edited

Loading

etiotto commented May 6, 2025

whitneywhtsang May 12, 2025

whitneywhtsang May 13, 2025

	if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType()))
	if (!mlir::triton::isTensorPointerType(loadOp.getResult().getType()))

New pass Reduce variable liveness #3965

Are you sure you want to change the base?

New pass Reduce variable liveness #3965

Conversation

mfrancepillois commented Apr 18, 2025 • edited Loading

mfrancepillois commented Apr 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whitneywhtsang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengjunlu commented May 6, 2025 • edited Loading

etiotto commented May 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfrancepillois commented Apr 18, 2025 •

edited

Loading

chengjunlu commented May 6, 2025 •

edited

Loading