Skip to content

aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Shreyas-fuj
Copy link
Contributor

Description

This PR introduces support for multiple block sizes for the N dimension in a matrix : MxK:KxN. Previously only 24 was supported, now it has been extended to 16 and 8 as well. Currently N block size is selected based on the N dimension, if N<16 then 16 is selected if N < 8, then 8 is selected. But this heuristics can be made more robust in future based on shapes of matrices and no.of available of threads.

Checklist

General

  • [y] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
make

98% tests passed, 4 tests failed out of 224

Total Test time (real) = 2486.52 sec

The following tests FAILED:
	172 - test_graph_unit_dnnl_large_partition_cpu (Failed)
	195 - test_benchdnn_modeC_binary_ci_cpu (Failed)
	196 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Failed)
	204 - test_benchdnn_modeC_graph_ci_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/shreyas/G/shr-fuj/oneDNN_open_source/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:71: test] Error 8
  • [y] Have you formatted the code using clang-format?

@Shreyas-fuj Shreyas-fuj requested review from a team as code owners May 28, 2025 07:14
@github-actions github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:api Codeowner: @oneapi-src/onednn-arch component:tests Codeowner: @oneapi-src/onednn-arch component:common labels May 28, 2025
@jondea
Copy link
Contributor

jondea commented May 28, 2025

Thank you! I've not had a detailed look at the code yet, but what difference does this make to performance. Could you give some examples of benchdnn calls which make use of the new code? Also, are the new paths covered by existing ctests?

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, I just have a few minor questions and comments to understand the why. In general, any extra comments explaining design choices would be greatly appreciated.

@@ -72,7 +72,7 @@ struct jit_int8_matmul_utils_kernel_t : public jit_generator {
void gen_reo_a();
void gen_reo_b();
void reo_A_8x8(int, int);
void reo_B_8x24(int, int);
void reo_B_8xN(int, int);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a short comment here explaining what N means? Also, would it be more informative to say 8N? Because we now support 8, 16 and 24?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N represents the N block size, N=8 or 16 or 24.

@@ -55,7 +55,8 @@ struct dyn_params_t {
struct brg_int8_t {
int M, K, N;
const int m_blk = 8, n_blk = 4, k_blk = 8;
const int ld_block = 6, rd_block = 4, bd_block = 8;
const int rd_block = 4, bd_block = 8;
int ld_block = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a comment explaining what this is roughly? And what the units are. I.e. I think this is the number of vector registers we load in the N dimension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are no.of vector registers. I will add some comment here, thanks.

@@ -1696,6 +1696,12 @@ struct memory : public handle<dnnl_memory_t> {
ABcde8b16a = dnnl_ABcde8b16a,
AcdeB8b16a = dnnl_AcdeB8b16a,
AB8b8a = dnnl_AB8b8a,
abDC8d8c = dnnl_abDC8d8c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, why do we need different memory formats for 8, 16 and 24? I think for similar kernels, ACL uses AB8b8a and trusts the prefetcher to pick up the 3 different streams. Is this something you considered? I'm just wondering if it would simplify the implementation.

Copy link
Contributor Author

@Shreyas-fuj Shreyas-fuj May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ACL uses a single tag like AB8b8a and picks x different streams for 8x block sizes.
We have followed the oneDNN style, which usually uses different reorder tags for varying block sizes.(For eg: brgemm matmul). We felt this would give a better memory locality(tested the ACL way but did not get any speed up). Correct me if I am wrong. Thanks!

@Shreyas-fuj
Copy link
Contributor Author

Thank you! I've not had a detailed look at the code yet, but what difference does this make to performance. Could you give some examples of benchdnn calls which make use of the new code? Also, are the new paths covered by existing ctests?

Thanks for the review @jondea ! These block sizes have being enabled so that in future anyone who comes up with a proper heuristics to determine the N block size can add it and the code would work seamlessly.
I can give few examples:
Memory optimization
If the matrix is 1024x2048x2048x16, it would make sense to block N to 16 element chucks rather than 24 as this would save memory that would have been introduced due to 0 padding.
For this example N=16, after reorder, size of B matrix = 2048x16
N=24, after reorder, size of B matrix = 2048x24(0 padding included)

Thread optimization:
If the matrix is 80x112048x112048x48,
And we chunk M in block size 8, m chunks=10
if we chunk N in block size in 24, n chunks=2, total parallel work = 2x10=20
if we chunk N in block size in 16, n chunks=3, total parallel work = 3x10=30
So as seen here we can utilize more threads.

In future if we or anyone comes up with any heuristics they can add it here and the code will be to absorb that.

For tests, we have also manually set ld_block=2,4,6(which sets N block size as 8,16,24 respectively) and ran the test ./benchdnn --matmul --batch=test_matmul_ci . All tests have passed in all configurations.

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8_var_nblk_sz branch from 10d2eb8 to b71fbad Compare May 28, 2025 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:api Codeowner: @oneapi-src/onednn-arch component:common component:tests Codeowner: @oneapi-src/onednn-arch platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants