Open
Description
Many applications (e.g. in machine learning) end up with matrix products that have small(ish) m
and n
dimensions, but large K. For these cases, one would need to parallelize the pc
loop to take advantage of threading in an efficient way. An example application is #536. The downside of parallelizing the pc
loop is that extra workspace is required to store the partial C matrices, and then these matrices have to be reduced (in parallel probably).
To support parallelizing this loop, I propose the following plan:
- Determine the maximum size
max_size_c
of the temporary C matrices that we are willing to store. This could be a constant (e.g. 8MiB) or it could depend on the cache blocking parameters in some way (e.g. not larger than the sum of the A block and B panel sizes). - Modify the automatic partitioning code to set the parallelism of the PC loop to the maximum value
nt_pc
(which is a factor ofnt
?) for which(nt_pc-1)*m*n <= max_size_c
, and then assign the remaining ways of parallelism based onnt -> nt/nt_pc
. Alternatively, the current algortihm could be extended to handle nt_ic, nt_jc, and nt_pc simultaneously and also obey the(nt_pc-1)*m*n <= max_size_c
restriction. If the user setsBLIS_PC_NT
then that can be used as-is (without obeying the size restriction?). - In
bli_gemm_blk_var3.c
:- The main thread in each thread group (after splitting for parallelism in the
pc
direction), except the "main" thread group, should check out a block from the existing pool for C, which is of sizem*n
. - The main thread group gets the "real" C and keeps
beta
as-is, all other threads get a chunk of the C buffer and setbeta = 0
. - Parallelize the
pc
loop as invar1
andvar2
. - After the loop, reduce (sum) the additional blocks back into the main block. This can be parallelized over
m
andn
, and use e.g.axpym
. - If the blocks of
C
should be of most sizem_c
along them
dimension, then the reduction step must happen inbli_gemm_blk_var1.c
.
- The main thread in each thread group (after splitting for parallelism in the