-
Notifications
You must be signed in to change notification settings - Fork 381
Add multithreading to most level-3 operations #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Added a multithreading infrastructure that should be independent of multithreading implementation in the future. Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
…thread infos In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency. This dependeny was removed, allowing each iteration to be executed in parallel. Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.
Conflicts: frame/1m/packm/bli_packm_blk_var1.c
This change makes each operation have its own thread info type, allowing more fine control of threading in operations that have different types of suboperations
Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED instead of BLIS_SINGLE_THREADED
The environment variables all follow the format BLIS_X_NT, where X is the index of the loop as described in our paper Anatomy of High Performance Many-Threaded Matrix Multiplication. These indices are IR, JR, IC, KC, and JC. Also enabled parallelism for hemm and symm, but these are currently untested.
Will allow for easy support for different threading models
…ate the same state Now just performed by the master thread.
Also fixed bugs in packm
Also enabled weighted partitioning for herk, trmm Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions Correctly computed a_next and b_next for gemm, herk macrokernels a_next and b_next point to the current micropanels in trmm
was innappropriately only having thread chief do some things.
This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines, but not the outer packing routines. This allowed, for instance, the block of B to not be finished being packed before computation to occur.
MR and NR for double complex were wrong Default fusing factor for double precision was wrong as well
Conflicts: kernels/bgq/1/bli_axpyv_opt_var1.c kernels/bgq/1/bli_dotv_opt_var1.c
Also made herk IC and JC loops do weighted partitioning
Fixed up some stuff in the thread info free functions Disabled threading for TRSM so that it actually works when threading environment variables are set
Removed barrier after unpackm in all level3 blocked variants Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case) Moved the enabling of the tree barriers into bli_config.h Fed the default MR and NR for double precision into bli_get_range instead of the number 8
… for determining parallelism granularity
No longer requires OpenMP to compile Define the following in bli_config.h in order to enable multithreading: BLIS_ENABLE_MULTITHREADING BLIS_ENABLE_OPENMP Also fixes a bug with bli_get_range_weighted
The loop has dependent iterations.
Now they are unchanged from the main branch of BLIS
fgvanzee
added a commit
that referenced
this pull request
May 20, 2014
Added multithreading to most level-3 operations.
Closed
Closed
loveshack
pushed a commit
to loveshack/blis
that referenced
this pull request
Sep 24, 2019
This needs fixing properly somehow, but using -O3 (at least with gcc 8.3), we get this: Program received signal SIGILL, Illegal instruction. 0x000000001004c660 in bli_cntx_init_power9_ref (cntx=0x103e06b0) at ref_kernels/bli_cntx_ref.c:456 456 for ( i = 0; i < BLIS_NUM_LEVEL3_OPS; ++i ) vfuncs[ i ] = NULL; (gdb) bt #0 0x000000001004c660 in bli_cntx_init_power9_ref (cntx=0x103e06b0) at ref_kernels/bli_cntx_ref.c:456 flame#1 0x000000001004c0a8 in bli_cntx_init_power9 (cntx=<optimized out>) at config/power9/bli_cntx_init_power9.c:42 flame#2 0x000000001003c85c in bli_gks_register_cntx (id=BLIS_ARCH_POWER9, nat_fp=0x1004c090 <bli_cntx_init_power9>, ref_fp=0x1004c0d0 <bli_cntx_init_power9_ref>, ind_fp=<optimized out>) at frame/base/bli_gks.c:373 flame#3 0x000000001003c97c in bli_gks_init () at frame/base/bli_gks.c:155 flame#4 0x000000001003cfe8 in bli_init_apis () at frame/base/bli_init.c:78 flame#5 0x00007ffff7e045a8 in __pthread_once_slow () from /lib64/libpthread.so.0 flame#6 0x00000000100492e8 in bli_pthread_once (once=<optimized out>, init=<optimized out>) at frame/thread/bli_pthread.c:314 flame#7 0x000000001003d138 in bli_init_once () at frame/base/bli_init.c:104 flame#8 bli_init_auto () at frame/base/bli_init.c:54 flame#9 0x0000000010011300 in cdotc_ (n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at frame/compat/bla_dot.c:89 flame#10 0x0000000010002a48 in check2_ (sfac=0x103d14dc <sfac>) at blastest/src/cblat1.c:529 flame#11 0x0000000010001ef4 in main () at blastest/src/cblat1.c:112
xrq-phys
added a commit
to xrq-phys/blis
that referenced
this pull request
May 29, 2021
- x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, flame#8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.
Aaron-Hutchinson
referenced
this pull request
in sifive/sifive-blis
Apr 4, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our paper "Anatomy of High-Performance Many-Threaded Matrix Multiplication" identified 5 loops around the micro-kernel as opportunities for parallelization. This pull request enables parallelism for 4 of those loops and extends to the rest of the level-3 operations except for TRSM.
Below chart describes those loops. Right now the only way to control the amount of parallelism is with environment variables, but we hope to add a nice threading API in the future. The total number of threads is equal to the product of the number of threads used for each loop. The 4th loop is not currently enabled as it requires a reduction because each iteration of the loop updates the same block of C.