Skip to content

Add multithreading to most level-3 operations #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
May 20, 2014

Conversation

tlrmchlsmth
Copy link
Member

Our paper "Anatomy of High-Performance Many-Threaded Matrix Multiplication" identified 5 loops around the micro-kernel as opportunities for parallelization. This pull request enables parallelism for 4 of those loops and extends to the rest of the level-3 operations except for TRSM.

Below chart describes those loops. Right now the only way to control the amount of parallelism is with environment variables, but we hope to add a nice threading API in the future. The total number of threads is equal to the product of the number of threads used for each loop. The 4th loop is not currently enabled as it requires a reduction because each iteration of the loop updates the same block of C.

Loop around micro-kernel Environment variable Direction Notes
1st loop BLIS_IR_NT M
2nd loop BLIS_JR_NT N
3rd loop BLIS_IC_NT M
4th loop None K Not enabled
5th loop BLIS_JC_NT N

tlrmchlsmth and others added 30 commits February 27, 2014 11:55
Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
…thread infos

In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency.
This dependeny was removed, allowing each iteration to be executed in parallel.

Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.
Conflicts:
	frame/1m/packm/bli_packm_blk_var1.c
This change makes each operation have its own thread info type,
allowing more fine control of threading in operations that have different types of suboperations
Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED
instead of BLIS_SINGLE_THREADED
The environment variables all follow the format BLIS_X_NT,
where X is the index of the loop as described in our paper
Anatomy of High Performance Many-Threaded Matrix Multiplication.
These indices are IR, JR, IC, KC, and JC.

Also enabled parallelism for hemm and symm, but these are currently untested.
Will allow for easy support for different threading models
…ate the same state

Now just performed by the master thread.
Also fixed bugs in packm
Also enabled weighted partitioning for herk, trmm
Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions
Correctly computed a_next and b_next for gemm, herk macrokernels
a_next and b_next point to the current micropanels in trmm
was innappropriately only having thread chief do some things.
This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines,
but not the outer packing routines.
This allowed, for instance, the block of B to not be finished being packed before computation to occur.
MR and NR for double complex were wrong
Default fusing factor for double precision was wrong as well
tlrmchlsmth and others added 20 commits April 4, 2014 09:54
Conflicts:
	kernels/bgq/1/bli_axpyv_opt_var1.c
	kernels/bgq/1/bli_dotv_opt_var1.c
Also made herk IC and JC loops do weighted partitioning
Fixed up some stuff in the thread info free functions
Disabled threading for TRSM so that it actually works when threading environment variables are set
Removed barrier after unpackm in all level3 blocked variants
Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case)

Moved the enabling of the tree barriers into bli_config.h
Fed the default MR and NR for double precision into bli_get_range instead of the number 8
No longer requires OpenMP to compile
Define the following in bli_config.h in order to enable multithreading:
BLIS_ENABLE_MULTITHREADING
BLIS_ENABLE_OPENMP

Also fixes a bug with bli_get_range_weighted
The loop has dependent iterations.
Now they are unchanged from the main branch of BLIS
fgvanzee added a commit that referenced this pull request May 20, 2014
Added multithreading to most level-3 operations.
@fgvanzee fgvanzee merged commit 77a2d8d into flame:master May 20, 2014
@songmaotian songmaotian mentioned this pull request Apr 22, 2016
@loveshack loveshack mentioned this pull request Mar 5, 2018
loveshack pushed a commit to loveshack/blis that referenced this pull request Sep 24, 2019
This needs fixing properly somehow, but using -O3 (at least with gcc 8.3),
we get this:

Program received signal SIGILL, Illegal instruction.
0x000000001004c660 in bli_cntx_init_power9_ref (cntx=0x103e06b0)
    at ref_kernels/bli_cntx_ref.c:456
456             for ( i = 0; i < BLIS_NUM_LEVEL3_OPS; ++i ) vfuncs[ i ] = NULL;
(gdb) bt
#0  0x000000001004c660 in bli_cntx_init_power9_ref (cntx=0x103e06b0)
    at ref_kernels/bli_cntx_ref.c:456
flame#1  0x000000001004c0a8 in bli_cntx_init_power9 (cntx=<optimized out>)
    at config/power9/bli_cntx_init_power9.c:42
flame#2  0x000000001003c85c in bli_gks_register_cntx (id=BLIS_ARCH_POWER9,
    nat_fp=0x1004c090 <bli_cntx_init_power9>,
    ref_fp=0x1004c0d0 <bli_cntx_init_power9_ref>, ind_fp=<optimized out>)
    at frame/base/bli_gks.c:373
flame#3  0x000000001003c97c in bli_gks_init () at frame/base/bli_gks.c:155
flame#4  0x000000001003cfe8 in bli_init_apis () at frame/base/bli_init.c:78
flame#5  0x00007ffff7e045a8 in __pthread_once_slow () from /lib64/libpthread.so.0
flame#6  0x00000000100492e8 in bli_pthread_once (once=<optimized out>,
    init=<optimized out>) at frame/thread/bli_pthread.c:314
flame#7  0x000000001003d138 in bli_init_once () at frame/base/bli_init.c:104
flame#8  bli_init_auto () at frame/base/bli_init.c:54
flame#9  0x0000000010011300 in cdotc_ (n=<optimized out>, x=<optimized out>,
    incx=<optimized out>, y=<optimized out>, incy=<optimized out>)
    at frame/compat/bla_dot.c:89
flame#10 0x0000000010002a48 in check2_ (sfac=0x103d14dc <sfac>)
    at blastest/src/cblat1.c:529
flame#11 0x0000000010001ef4 in main () at blastest/src/cblat1.c:112
xrq-phys added a commit to xrq-phys/blis that referenced this pull request May 29, 2021
- x7, x8: Used to store address for Alpha and Beta.
  As Alpha & Beta was not used in k-loops, use x0, x1 to load
  Alpha & Beta's addresses after k-loops are completed, since A & B's
  addresses are no longer needed there.
  This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
  drawback since it is done outside k-loops and there are plenty of
  instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
  any longer. Directly loading cs_c and into x10 and scale by 8 spares
  x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
  also used in a conditional branch so that "cmp x13, #1" needs to be
  modified into "cmp x14, flame#8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
  these addresses into x0 and x1 after Alpha & Beta are both loaded,
  since then neigher address of A/B nor address of Alpha/Beta is needed.
Aaron-Hutchinson referenced this pull request in sifive/sifive-blis Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants