Parallelization of the PC loop

Many applications (e.g. in machine learning) end up with matrix products that have small(ish) `m` and `n` dimensions, but large K. For these cases, one would need to parallelize the `pc` loop to take advantage of threading in an efficient way. An example application is #536. The downside of parallelizing the `pc` loop is that extra workspace is required to store the partial C matrices, and then these matrices have to be reduced (in parallel probably).

To support parallelizing this loop, I propose the following plan:
- [ ] Determine the maximum size `max_size_c` of the temporary C matrices that we are willing to store. This could be a constant (e.g. 8MiB) or it could depend on the cache blocking parameters in some way (e.g. not larger than the sum of the A block and B panel sizes).
- [ ] Modify the automatic partitioning code to set the parallelism of the PC loop to the maximum value `nt_pc` (which is a factor of `nt`?) for which `(nt_pc-1)*m*n <= max_size_c`, and then assign the remaining ways of parallelism based on `nt -> nt/nt_pc`. Alternatively, the current algortihm could be extended to handle nt_ic, nt_jc, and nt_pc simultaneously and also obey the `(nt_pc-1)*m*n <= max_size_c` restriction. If the user sets `BLIS_PC_NT` then that can be used as-is (without obeying the size restriction?).
- [ ] In `bli_gemm_blk_var3.c`:
    - The main thread in each thread group (after splitting for parallelism in the `pc` direction), except the "main" thread group, should check out a block from the existing pool for C, which is of size `m*n`.
    - The main thread group gets the "real" C and keeps `beta` as-is, all other threads get a chunk of the C buffer and set `beta = 0`.
    - Parallelize the `pc` loop as in `var1` and `var2`.
    - After the loop, reduce (sum) the additional blocks back into the main block. This can be parallelized over `m` and `n`, and use e.g. `axpym`.
    - If the blocks of `C` should be of most size `m_c` along the `m` dimension, then the reduction step must happen in `bli_gemm_blk_var1.c`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelization of the PC loop #537

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelization of the PC loop #537

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions