-
Notifications
You must be signed in to change notification settings - Fork 1k
aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
aarch64: matmul: Enabling variable N block sizes for jit int8 matmul #3348
Conversation
Thank you! I've not had a detailed look at the code yet, but what difference does this make to performance. Could you give some examples of benchdnn calls which make use of the new code? Also, are the new paths covered by existing ctests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, I just have a few minor questions and comments to understand the why. In general, any extra comments explaining design choices would be greatly appreciated.
@@ -72,7 +72,7 @@ struct jit_int8_matmul_utils_kernel_t : public jit_generator { | |||
void gen_reo_a(); | |||
void gen_reo_b(); | |||
void reo_A_8x8(int, int); | |||
void reo_B_8x24(int, int); | |||
void reo_B_8xN(int, int); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a short comment here explaining what N means? Also, would it be more informative to say 8N? Because we now support 8, 16 and 24?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N represents the N block size, N=8 or 16 or 24.
@@ -55,7 +55,8 @@ struct dyn_params_t { | |||
struct brg_int8_t { | |||
int M, K, N; | |||
const int m_blk = 8, n_blk = 4, k_blk = 8; | |||
const int ld_block = 6, rd_block = 4, bd_block = 8; | |||
const int rd_block = 4, bd_block = 8; | |||
int ld_block = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a comment explaining what this is roughly? And what the units are. I.e. I think this is the number of vector registers we load in the N dimension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these are no.of vector registers. I will add some comment here, thanks.
@@ -1696,6 +1696,12 @@ struct memory : public handle<dnnl_memory_t> { | |||
ABcde8b16a = dnnl_ABcde8b16a, | |||
AcdeB8b16a = dnnl_AcdeB8b16a, | |||
AB8b8a = dnnl_AB8b8a, | |||
abDC8d8c = dnnl_abDC8d8c, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, why do we need different memory formats for 8, 16 and 24? I think for similar kernels, ACL uses AB8b8a and trusts the prefetcher to pick up the 3 different streams. Is this something you considered? I'm just wondering if it would simplify the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ACL uses a single tag like AB8b8a and picks x different streams for 8x block sizes.
We have followed the oneDNN style, which usually uses different reorder tags for varying block sizes.(For eg: brgemm matmul). We felt this would give a better memory locality(tested the ACL way but did not get any speed up). Correct me if I am wrong. Thanks!
Thanks for the review @jondea ! These block sizes have being enabled so that in future anyone who comes up with a proper heuristics to determine the N block size can add it and the code would work seamlessly. Thread optimization: In future if we or anyone comes up with any heuristics they can add it here and the code will be to absorb that. For tests, we have also manually set ld_block=2,4,6(which sets N block size as 8,16,24 respectively) and ran the test |
10d2eb8
to
b71fbad
Compare
Description
This PR introduces support for multiple block sizes for the N dimension in a matrix : MxK:KxN. Previously only 24 was supported, now it has been extended to 16 and 8 as well. Currently N block size is selected based on the N dimension, if N<16 then 16 is selected if N < 8, then 8 is selected. But this heuristics can be made more robust in future based on shapes of matrices and no.of available of threads.
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?