Combine parallel dense Optimization pass in ONNX Dialect #3123

Arkar-Hema · 2025-04-16T08:34:16Z

Combine Parallel Dense

CombineParallelDense is an optimization pass designed to merge multiple parallel ONNXGemmOp (Dense/Fully Connected) operations into a single, more efficient Dense layer. This optimization reduces redundant computations, improves memory efficiency, and enhances hardware utilization.

The pass identifies Dense (Gemm) operations that:

Share the same input tensor.
Have identical attributes such as alpha, beta, transA and transB (ensuring compatibility).
May have different output dimensions (number of neurons) but maintain compatible weight shapes for concatenation.

Lets assume a input case:

Input Shape: (1, 512)
Dense A: out_features = 256
Dense B: out_features = 128
Dense C: out_features = 64
Attributes: transB = 0, alpha = 1.0, beta = 1.0

Before Optimization (Three Parallel Gemms)

Each GEMM does one full matrix multiplication (1×512 × 512×N)
Three separate weight and bias tensors and produces three outputs
-Memory Reads: 3 times full input (one for each gemm)
Post-processing: A Concat(axis=1) merges them into one output: Y (1×448)

After Optimization (Combined Dense)

Total Output Features: 256 + 128 + 64 = 448
All weights are concatenated along output channel axis → New weight shape: (512, 448)
Biases are also concatenated
A single ONNXGemmOp computes Y (1×448) directly

Improvement in performance metrics

Latency Improvement: 7-15%
Throughput Improvement: 8-14%
Memory Usage Improvement: 10-12%

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-04-16T08:42:27Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-04-16T08:45:46Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-04-16T09:47:27Z

Can one of the admins verify this patch?

tungld · 2025-04-17T03:59:17Z

@Arkar-Hema A general question: in what kind of models have you seen this kind of pattern: multiple Gemm ops followed by a Concat op? and also similar patterns you have recently created PRs for? Just curious on how practical it is. Thanks!

Arkar-Hema · 2025-04-17T08:21:58Z

@Arkar-Hema A general question: in what kind of models have you seen this kind of pattern: multiple Gemm ops followed by a Concat op? and also similar patterns you have recently created PRs for? Just curious on how practical it is. Thanks!

Models with the CombineParallelDense pattern (Combine parallel dense Optimization pass in ONNX Dialect #3123):
These contain multiple Gemm ops, though not always followed by a Concat. I added the Concat condition to the pass so it would still handle those cases gracefully if present. Some models with this pattern include:

Bertsquad-8
Bertsquad-10
Bertsquad-12
FasterRCNN-10

Models with the Reorder ReLU and MaxPool pattern (Reorder relu to maxpool optimization pass in ONNX dialect #3109): This pattern shows up frequently in vision models like:

ResNet101-DUC-12
ResNet101-DUC-7
emotion-ferplus models
caffenet models
Densenet models
googlenet models
inception models
rcnn-ilsvrc13 models
resnet models
vgg models

In the yolo models shows up the merge concat pattern (Merge nested concat Ops optimization pass in ONNX dialect #3111)
Models with the CombineParallelConv pattern (Combine Parallel Convolution optimization pass in ONNX Dialect #3116): This shows up in many convolution-heavy models like:

retinanet models
version-RFB-320
version-RFB-640
googlenet models
inception models
resnet models
squeezenet models

Arkar-Hema · 2025-04-22T03:52:46Z

@tungld could you please verify this patch?

tungld · 2025-04-22T03:57:54Z

@Arkar-Hema thank you for the information!

I have some general comments:

I think that when multiple GEMM ops are followed by a concat, the performance in theory would be better. But, could you run with multiple input sizes to see how the performance benefit in practice?
When multiple GEMM ops are NOT followed by a concat (this is the case for the models you listed), you need a split and I think the split axis is the innermost dimension. I am not sure how slow the split is and if we can get speedup or not. Could you do a performance comparison to see if you can achieve speedup in this case?
Are you targeting optimization for CPU or it is beneficial for AI accelerators as well given that AI accelerators may use special data layout which may be not convenient for concat or split.

Thanks.

Arkar-Hema · 2025-04-22T08:40:38Z

I ran performance benchmarks across a range of input sizes for both the GEMM → Concat and the Combined GEMM → Split cases. Results show that:

In the Concat case, the optimization provides consistent Latency improvement of 2-7%, and throughput improvement of 1-5%
In the cases where it splits, the optimization provides consistent Latency improvement of 1-7%, and throughput improvement of 1-8%
I’ve currently targeted this pass primarily for CPU backends only.

tungld

Thanks @Arkar-Hema for the experiments! Did you compile your programs with -O3?

Since this parallel fusion may not work for accelerators, could you create a compile option to enable this if needed, for example -fuse-parallel-onnx-gemm?

I don't think you need to handle the case where there is a concat after multiple gemms. Just emit a split op, then later you can write a simple canonicalization rule for concat to fuse Split -> Concat.

Below are my first-round comments, most of them are for simplifying the code, making it easy to follow. However, the important thing is you need to check the input C carefully because it's broadcastable.

src/Dialect/ONNX/Transforms/Recompose.cpp

test/mlir/onnx/onnx_recompose_combine_parallel_dense.mlir

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T08:41:06Z

Can one of the admins verify this patch?

Arkar-Hema · 2025-05-02T08:47:46Z

Thanks @Arkar-Hema for the experiments! Did you compile your programs with -O3?

Since this parallel fusion may not work for accelerators, could you create a compile option to enable this if needed, for example -fuse-parallel-onnx-gemm?

I don't think you need to handle the case where there is a concat after multiple gemms. Just emit a split op, then later you can write a simple canonicalization rule for concat to fuse Split -> Concat.

Below are my first-round comments, most of them are for simplifying the code, making it easy to follow. However, the important thing is you need to check the input C carefully because it's broadcastable.

I have added it, Thanks

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T10:05:26Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T11:16:41Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T12:29:21Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-05T03:24:18Z

Can one of the admins verify this patch?

AlexandreEichenberger · 2025-05-05T14:50:27Z

@jenkins-droid test this please

src/Dialect/ONNX/DialectBuilder.hpp

tungld · 2025-05-07T02:16:32Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+      auto aCShape = mlir::cast<ShapedType>(aC.getType()).getShape();
+      auto bCShape = mlir::cast<ShapedType>(bC.getType()).getShape();
+      if (aCShape.size() != 1 || bCShape.size() != 1)
+        return false;


It seems you allow the case where aCShape is tensor<1xf32> and bCShape is tensor<5xf32> (5 is just an example to say it it not 1) and vice versa, but I don't see in the following code how you handle it. In this case, we need to broadcast aC to tensor<5xf32> before concatenating it with bC to make ConcatOp valid.

It's up to you to support this case or not, but if you do, please add a lit test. Otherwise, check it and return false here. Thanks!

@Arkar-Hema please explain how did you solve this comment so that you marked it "solved"?

In the current implementation, I decided not to support the case where the bias shapes are different (e.g., tensor<1xf32> and tensor<5xf32>) since our concat operation would require explicit broadcasting to align the shapes before concatenation, and handling this properly would add additional complexity.

To handle this, I’ve added a check in areCompatible() to return false if the bias shapes differ in size when both biases are present - ensuring that we only merge Gemms where both bias tensors have the same shape. This preserves the correctness of the concat operation without requiring extra broadcasting logic.

Thank you!

I see you checked aCShape[0] != bCShape[0]. This is only valid if the shapes are static, e.g. tensor<5xf32>, but it does not work if the shapes are dynamic, e.g. both aC and bC have shape of tensor<?xf32>.

In the dynamic case, aCShape[0] == bCShape[0] during compile time, but at runtime, aC can be tensor<1xf32 and bC can be tensor<5xf32 for example.

For the static case, I wonder how you handle the following case where both aC and bC have shape of tensor<1xf32>. For example:

gemm1: A: tensor<5x8x16xf32>, B: tensor<16x32xf32>, C: tensor<1xf32>

gemm2: A: tensor<5x8x16xf32>, B: tensor<16x32xf32>, C: tensor<1xf32>

They satisfy your conditions here, so how do you combine them?

Thanks for the thorough review and the great questions!

I’ve updated the areCompatible() function to properly handle the edge cases you pointed out:

Dynamic Bias Shapes:
If either of the bias tensors has a dynamic shape at dimension 0 (i.e., tensor<?xf32>), I now conservatively return false since we can’t guarantee at compile time whether they’ll match or require broadcasting at runtime.

Both Biases as tensor<1xf32>:
If both biases are of shape tensor<1xf32>, I now check their corresponding Gemm output shapes and ensure their output channels (last dimension) match before considering them compatible. If they differ, the function returns false, as merging them without this check would be invalid.

This ensures that both static and dynamic cases are handled correctly and conservatively avoids undefined behavior at runtime.

src/Dialect/ONNX/Transforms/Recompose.cpp

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T04:30:35Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T04:34:10Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T06:56:00Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T12:16:41Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T12:20:40Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-09T04:07:37Z

Can one of the admins verify this patch?

AlexandreEichenberger · 2025-05-09T13:23:38Z

@jenkins-droid test this please

tungld · 2025-05-12T02:26:10Z

Hi @Arkar-Hema When addressing a comment, could you please provide a brief explanation of how you did so? This will make the review process easier. Thanks!

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-15T03:27:35Z

Can one of the admins verify this patch?

Arkar-Hema added 2 commits April 16, 2025 04:20

Combine parallel dense optimization pass

7aa505a

Signed-off-by: Arkar-Hema <[email protected]>

Clang format modified

997d9e5

Signed-off-by: Arkar-Hema <[email protected]>

Clang format modified

15343bf

Signed-off-by: Arkar-Hema <[email protected]>

Added the unit test for the pass

4fad7ea

Signed-off-by: Arkar-Hema <[email protected]>

tungld reviewed Apr 23, 2025

View reviewed changes

AlexandreEichenberger and others added 2 commits May 1, 2025 10:59

Merge branch 'main' into combine_parallel_dense

d4d2fda

Updated test case, added compiler flag, and builder for gemm

e6d8d6c

Signed-off-by: Arkar-Hema <[email protected]>

Arkar-Hema added 3 commits May 2, 2025 05:00

Clang format fix

7b357e2

Signed-off-by: Arkar-Hema <[email protected]>

Clang fix

ab7a2aa

Signed-off-by: Arkar-Hema <[email protected]>

Added compiler option

2b466ca

Signed-off-by: Arkar-Hema <[email protected]>

Added compiler option in test case

6aff5ab

Signed-off-by: Arkar-Hema <[email protected]>

Test case updation

d8611f5

Signed-off-by: Arkar-Hema <[email protected]>

AlexandreEichenberger and others added 2 commits May 2, 2025 08:38

Merge branch 'main' into combine_parallel_dense

20cab0c

Merge branch 'main' into combine_parallel_dense

3bf58c0

Signed-off-by: Arkar-Hema <[email protected]>

tungld reviewed May 7, 2025

View reviewed changes

Added lit test for dynamic shapes

d07e896

Signed-off-by: Arkar-Hema <[email protected]>

Clang format fix

2266d90

Signed-off-by: Arkar-Hema <[email protected]>

Added unrankedtype for outputtype

8882476

Signed-off-by: Arkar-Hema <[email protected]>

Added ranked type for output type

c8d2946

Signed-off-by: Arkar-Hema <[email protected]>

Clang format fix

dd42652

Signed-off-by: Arkar-Hema <[email protected]>

AlexandreEichenberger and others added 2 commits May 8, 2025 09:11

Merge branch 'main' into combine_parallel_dense

ca09e94

Updated output type

3f66539

Signed-off-by: Arkar-Hema <[email protected]>

Arkar-Hema added 3 commits May 13, 2025 05:12

Updated Compatible function

2f0f113

Signed-off-by: Arkar-Hema <[email protected]>

clang fix

c2b3728

Signed-off-by: Arkar-Hema <[email protected]>

Resolved conflicts

81df6d9

Signed-off-by: Arkar-Hema <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine parallel dense Optimization pass in ONNX Dialect #3123

Combine parallel dense Optimization pass in ONNX Dialect #3123

Arkar-Hema commented Apr 16, 2025

jenkins-droid commented Apr 16, 2025

jenkins-droid commented Apr 16, 2025

jenkins-droid commented Apr 16, 2025

tungld commented Apr 17, 2025

Arkar-Hema commented Apr 17, 2025

Arkar-Hema commented Apr 22, 2025

tungld commented Apr 22, 2025

Arkar-Hema commented Apr 22, 2025

tungld left a comment •

edited

Loading

jenkins-droid commented May 2, 2025

Arkar-Hema commented May 2, 2025

jenkins-droid commented May 2, 2025

jenkins-droid commented May 2, 2025

jenkins-droid commented May 2, 2025

jenkins-droid commented May 5, 2025

AlexandreEichenberger commented May 5, 2025

tungld May 7, 2025

tungld May 12, 2025

Arkar-Hema May 12, 2025

tungld May 12, 2025

tungld May 12, 2025

Arkar-Hema May 13, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 9, 2025

AlexandreEichenberger commented May 9, 2025

tungld commented May 12, 2025

jenkins-droid commented May 15, 2025

Combine parallel dense Optimization pass in ONNX Dialect #3123

Are you sure you want to change the base?

Combine parallel dense Optimization pass in ONNX Dialect #3123

Conversation

Arkar-Hema commented Apr 16, 2025

jenkins-droid commented Apr 16, 2025

jenkins-droid commented Apr 16, 2025

jenkins-droid commented Apr 16, 2025

tungld commented Apr 17, 2025

Arkar-Hema commented Apr 17, 2025

Arkar-Hema commented Apr 22, 2025

tungld commented Apr 22, 2025

Arkar-Hema commented Apr 22, 2025

tungld left a comment • edited Loading

Choose a reason for hiding this comment

jenkins-droid commented May 2, 2025

Arkar-Hema commented May 2, 2025

jenkins-droid commented May 2, 2025

jenkins-droid commented May 2, 2025

jenkins-droid commented May 2, 2025

jenkins-droid commented May 5, 2025

AlexandreEichenberger commented May 5, 2025

tungld May 7, 2025

Choose a reason for hiding this comment

tungld May 12, 2025

Choose a reason for hiding this comment

Arkar-Hema May 12, 2025

Choose a reason for hiding this comment

tungld May 12, 2025

Choose a reason for hiding this comment

tungld May 12, 2025

Choose a reason for hiding this comment

Arkar-Hema May 13, 2025

Choose a reason for hiding this comment

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 8, 2025

jenkins-droid commented May 9, 2025

AlexandreEichenberger commented May 9, 2025

tungld commented May 12, 2025

jenkins-droid commented May 15, 2025

tungld left a comment •

edited

Loading