aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348

Shreyas-fuj · 2025-05-28T07:14:03Z

Description

This PR introduces support for multiple block sizes for the N dimension in a matrix : MxK:KxN. Previously only 24 was supported, now it has been extended to 16 and 8 as well. Currently N block size is selected based on the N dimension, if N<16 then 16 is selected if N < 8, then 8 is selected. But this heuristics can be made more robust in future based on shapes of matrices and no.of available of threads.

Checklist

General

[y] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?

make

98% tests passed, 4 tests failed out of 224

Total Test time (real) = 2486.52 sec

The following tests FAILED:
	172 - test_graph_unit_dnnl_large_partition_cpu (Failed)
	195 - test_benchdnn_modeC_binary_ci_cpu (Failed)
	196 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Failed)
	204 - test_benchdnn_modeC_graph_ci_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/shreyas/G/shr-fuj/oneDNN_open_source/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:71: test] Error 8

[y] Have you formatted the code using clang-format?

jondea · 2025-05-28T10:44:04Z

Thank you! I've not had a detailed look at the code yet, but what difference does this make to performance. Could you give some examples of benchdnn calls which make use of the new code? Also, are the new paths covered by existing ctests?

jondea

Generally looks great, I just have a few minor questions and comments to understand the why. In general, any extra comments explaining design choices would be greatly appreciated.

jondea · 2025-05-28T10:55:16Z

src/cpu/aarch64/matmul/jit_int8_matmul_utils.hpp

@@ -72,7 +72,7 @@ struct jit_int8_matmul_utils_kernel_t : public jit_generator {
    void gen_reo_a();
    void gen_reo_b();
    void reo_A_8x8(int, int);
-    void reo_B_8x24(int, int);
+    void reo_B_8xN(int, int);


Could we have a short comment here explaining what N means? Also, would it be more informative to say 8N? Because we now support 8, 16 and 24?

N represents the N block size, N=8 or 16 or 24.

jondea · 2025-05-28T10:56:52Z

src/cpu/aarch64/matmul/jit_int8_kernel_types.hpp

@@ -55,7 +55,8 @@ struct dyn_params_t {
 struct brg_int8_t {
    int M, K, N;
    const int m_blk = 8, n_blk = 4, k_blk = 8;
-    const int ld_block = 6, rd_block = 4, bd_block = 8;
+    const int rd_block = 4, bd_block = 8;
+    int ld_block = 6;


Could we have a comment explaining what this is roughly? And what the units are. I.e. I think this is the number of vector registers we load in the N dimension?

Yes, these are no.of vector registers. I will add some comment here, thanks.

jondea · 2025-05-28T11:00:41Z

include/oneapi/dnnl/dnnl.hpp

@@ -1696,6 +1696,12 @@ struct memory : public handle<dnnl_memory_t> {
        ABcde8b16a = dnnl_ABcde8b16a,
        AcdeB8b16a = dnnl_AcdeB8b16a,
        AB8b8a = dnnl_AB8b8a,
+        abDC8d8c = dnnl_abDC8d8c,


Out of interest, why do we need different memory formats for 8, 16 and 24? I think for similar kernels, ACL uses AB8b8a and trusts the prefetcher to pick up the 3 different streams. Is this something you considered? I'm just wondering if it would simplify the implementation.

Yes, ACL uses a single tag like AB8b8a and picks x different streams for 8x block sizes.
We have followed the oneDNN style, which usually uses different reorder tags for varying block sizes.(For eg: brgemm matmul). We felt this would give a better memory locality(tested the ACL way but did not get any speed up). Correct me if I am wrong. Thanks!

Shreyas-fuj · 2025-05-28T11:38:39Z

Thank you! I've not had a detailed look at the code yet, but what difference does this make to performance. Could you give some examples of benchdnn calls which make use of the new code? Also, are the new paths covered by existing ctests?

Thanks for the review @jondea ! These block sizes have being enabled so that in future anyone who comes up with a proper heuristics to determine the N block size can add it and the code would work seamlessly.
I can give few examples:
Memory optimization
If the matrix is 1024x2048x2048x16, it would make sense to block N to 16 element chucks rather than 24 as this would save memory that would have been introduced due to 0 padding.
For this example N=16, after reorder, size of B matrix = 2048x16
N=24, after reorder, size of B matrix = 2048x24(0 padding included)

Thread optimization:
If the matrix is 80x112048x112048x48,
And we chunk M in block size 8, m chunks=10
if we chunk N in block size in 24, n chunks=2, total parallel work = 2x10=20
if we chunk N in block size in 16, n chunks=3, total parallel work = 3x10=30
So as seen here we can utilize more threads.

In future if we or anyone comes up with any heuristics they can add it here and the code will be to absorb that.

For tests, we have also manually set ld_block=2,4,6(which sets N block size as 8,16,24 respectively) and ran the test ./benchdnn --matmul --batch=test_matmul_ci . All tests have passed in all configurations.

Shreyas-fuj requested review from a team as code owners May 28, 2025 07:14

github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:api Codeowner: @oneapi-src/onednn-arch component:tests Codeowner: @oneapi-src/onednn-arch component:common labels May 28, 2025

jondea requested changes May 28, 2025

View reviewed changes

aarch64: matmul: Enabling variable N block sizes for jit int8 matmul

b71fbad

Shreyas-fuj force-pushed the aarch64_matmul_int8_var_nblk_sz branch from 10d2eb8 to b71fbad Compare May 28, 2025 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348

aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348

Uh oh!

Shreyas-fuj commented May 28, 2025

Uh oh!

jondea commented May 28, 2025

Uh oh!

jondea left a comment

Uh oh!

jondea May 28, 2025

Uh oh!

Shreyas-fuj May 28, 2025

Uh oh!

jondea May 28, 2025

Uh oh!

Shreyas-fuj May 28, 2025

Uh oh!

jondea May 28, 2025

Uh oh!

Shreyas-fuj May 28, 2025 •

edited

Loading

Uh oh!

Shreyas-fuj commented May 28, 2025

Uh oh!

Uh oh!

aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348

Are you sure you want to change the base?

aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348

Uh oh!

Conversation

Shreyas-fuj commented May 28, 2025

Description

Checklist

General

Uh oh!

jondea commented May 28, 2025

Uh oh!

jondea left a comment

Choose a reason for hiding this comment

Uh oh!

jondea May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Shreyas-fuj May 28, 2025

Choose a reason for hiding this comment

Uh oh!

jondea May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Shreyas-fuj May 28, 2025

Choose a reason for hiding this comment

Uh oh!

jondea May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Shreyas-fuj May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shreyas-fuj commented May 28, 2025

Uh oh!

Uh oh!

Shreyas-fuj May 28, 2025 •

edited

Loading