Add multithreading to most level-3 operations #8

tlrmchlsmth · 2014-05-13T17:50:57Z

Our paper "Anatomy of High-Performance Many-Threaded Matrix Multiplication" identified 5 loops around the micro-kernel as opportunities for parallelization. This pull request enables parallelism for 4 of those loops and extends to the rest of the level-3 operations except for TRSM.

Below chart describes those loops. Right now the only way to control the amount of parallelism is with environment variables, but we hope to add a nice threading API in the future. The total number of threads is equal to the product of the number of threads used for each loop. The 4th loop is not currently enabled as it requires a reduction because each iteration of the loop updates the same block of C.

Loop around micro-kernel	Environment variable	Direction	Notes
1st loop	BLIS_IR_NT	M
2nd loop	BLIS_JR_NT	N
3rd loop	BLIS_IC_NT	M
4th loop	None	K	Not enabled
5th loop	BLIS_JC_NT	N

Added a multithreading infrastructure that should be independent of multithreading implementation in the future. Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.

…thread infos In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency. This dependeny was removed, allowing each iteration to be executed in parallel. Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.

Conflicts: frame/1m/packm/bli_packm_blk_var1.c

This change makes each operation have its own thread info type, allowing more fine control of threading in operations that have different types of suboperations

…mm and packm

Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED instead of BLIS_SINGLE_THREADED

The environment variables all follow the format BLIS_X_NT, where X is the index of the loop as described in our paper Anatomy of High Performance Many-Threaded Matrix Multiplication. These indices are IR, JR, IC, KC, and JC. Also enabled parallelism for hemm and symm, but these are currently untested.

Will allow for easy support for different threading models

…ate the same state Now just performed by the master thread.

Also fixed bugs in packm

Also enabled weighted partitioning for herk, trmm Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions Correctly computed a_next and b_next for gemm, herk macrokernels a_next and b_next point to the current micropanels in trmm

was innappropriately only having thread chief do some things.

This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines, but not the outer packing routines. This allowed, for instance, the block of B to not be finished being packed before computation to occur.

MR and NR for double complex were wrong Default fusing factor for double precision was wrong as well

Conflicts: kernels/bgq/1/bli_axpyv_opt_var1.c kernels/bgq/1/bli_dotv_opt_var1.c

Also made herk IC and JC loops do weighted partitioning

Fixed up some stuff in the thread info free functions Disabled threading for TRSM so that it actually works when threading environment variables are set

Removed barrier after unpackm in all level3 blocked variants Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case) Moved the enabling of the tree barriers into bli_config.h Fed the default MR and NR for double precision into bli_get_range instead of the number 8

…ULT_*_MC

… for determining parallelism granularity

No longer requires OpenMP to compile Define the following in bli_config.h in order to enable multithreading: BLIS_ENABLE_MULTITHREADING BLIS_ENABLE_OPENMP Also fixes a bug with bli_get_range_weighted

The loop has dependent iterations.

Now they are unchanged from the main branch of BLIS

Added multithreading to most level-3 operations.

This needs fixing properly somehow, but using -O3 (at least with gcc 8.3), we get this: Program received signal SIGILL, Illegal instruction. 0x000000001004c660 in bli_cntx_init_power9_ref (cntx=0x103e06b0) at ref_kernels/bli_cntx_ref.c:456 456 for ( i = 0; i < BLIS_NUM_LEVEL3_OPS; ++i ) vfuncs[ i ] = NULL; (gdb) bt #0 0x000000001004c660 in bli_cntx_init_power9_ref (cntx=0x103e06b0) at ref_kernels/bli_cntx_ref.c:456 flame#1 0x000000001004c0a8 in bli_cntx_init_power9 (cntx=<optimized out>) at config/power9/bli_cntx_init_power9.c:42 flame#2 0x000000001003c85c in bli_gks_register_cntx (id=BLIS_ARCH_POWER9, nat_fp=0x1004c090 <bli_cntx_init_power9>, ref_fp=0x1004c0d0 <bli_cntx_init_power9_ref>, ind_fp=<optimized out>) at frame/base/bli_gks.c:373 flame#3 0x000000001003c97c in bli_gks_init () at frame/base/bli_gks.c:155 flame#4 0x000000001003cfe8 in bli_init_apis () at frame/base/bli_init.c:78 flame#5 0x00007ffff7e045a8 in __pthread_once_slow () from /lib64/libpthread.so.0 flame#6 0x00000000100492e8 in bli_pthread_once (once=<optimized out>, init=<optimized out>) at frame/thread/bli_pthread.c:314 flame#7 0x000000001003d138 in bli_init_once () at frame/base/bli_init.c:104 flame#8 bli_init_auto () at frame/base/bli_init.c:54 flame#9 0x0000000010011300 in cdotc_ (n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at frame/compat/bla_dot.c:89 flame#10 0x0000000010002a48 in check2_ (sfac=0x103d14dc <sfac>) at blastest/src/cblat1.c:529 flame#11 0x0000000010001ef4 in main () at blastest/src/cblat1.c:112

- x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, flame#8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.

tlrmchlsmth and others added 30 commits February 27, 2014 11:55

First pass at adding parallelism to BLIS.

01b125e

Added a multithreading infrastructure that should be independent of multithreading implementation in the future. Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.

Merge branch 'master' of https://github.com/tlrmchlsmth/blis

ac5a2de

Fixed bug in thread trees

6193d9c

Added support for parallelism in gemm micro-kernel

e4738c4

Merge https://github.com/flame/blis

2c158fb

Conflicts: frame/1m/packm/bli_packm_blk_var1.c

Merge https://github.com/flame/blis

b3bff63

Modifying the thread info data structures

2e727a0

This change makes each operation have its own thread info type, allowing more fine control of threading in operations that have different types of suboperations

Merge branch 'master' of https://github.com/tlrmchlsmth/blis

0e86777

Added single threaded thread info data structures specifically for ge…

8d8f435

…mm and packm

Added files specific to threading for gemm and packm operations

020f80c

Some fixes to gemm thread info tree creation,

92233cf

Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED instead of BLIS_SINGLE_THREADED

Initial multithreading support for HERK

c51d011

Fixing some bugs with herk parallelization

5296f58

Added decorator for calling parallelized intermal functions

0ac534c

Will allow for easy support for different threading models

Enabled threading for packm blocked variants 3 and 4

ec8b88f

Fixing function pointer issues with thread decorator

aa2405f

Fixed a barrier bug and a thread decorator bug

fb42983

Fixed packm variants 3 and 4 where every thread was trying to manipul…

c0140cb

…ate the same state Now just performed by the master thread.

Parallelized trmm and trmm3

5d5dc2e

Also fixed bugs in packm

Merge https://github.com/flame/blis

23d9eab

Some fixes for the bgq configuration

73b3db5

Added test drivers for level 3 BLAS that run tests in parallel using MPI

a6fd483

Some fixes for the internal functions,

9f78ec6

was innappropriately only having thread chief do some things.

Made barrier after packing implicit.

459dde4

This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines, but not the outer packing routines. This allowed, for instance, the block of B to not be finished being packed before computation to occur.

Fixed race condition involving scalar reset

1584ae1

Added barriers needed prior to doing scalar reset for rank-k updates.

2041c26

Some fixes to the bgq config

4e3eb39

MR and NR for double complex were wrong Default fusing factor for double precision was wrong as well

tlrmchlsmth and others added 20 commits April 4, 2014 09:54

Merge http://github.com/flame/blis

2b6848b

Conflicts: kernels/bgq/1/bli_axpyv_opt_var1.c kernels/bgq/1/bli_dotv_opt_var1.c

Freeing thread info paths.

ec58a79

Also made herk IC and JC loops do weighted partitioning

Added faster tree barriers necessary for performance for Xeon Phi

ab9c788

Fixed up some stuff in the thread info free functions Disabled threading for TRSM so that it actually works when threading environment variables are set

Changed default blocking factor to default double precision MR and NR

575fb9b

Fix for tree barrier freeing bug

7b9b228

Used BLIS_DEFAULT_*_MR for rounding partitioning instead of BLIS_DEFA…

e7ca9e4

…ULT_*_MC

Added -openmp flag to Xeon Phi build for convenience

c332be8

Add -openmp to ldflags as well

bde697f

Some fixes for the bgq kernels

20e2443

Merge http://github.com/flame/blis

31bb065

Merge http://github.com/flame/blis

f4fdfe8

Replaced register blocksize hack with querying the register blocksize…

456df03

… for determining parallelism granularity

Disabled multithreading of the kc loop

bd1dc98

Allowed threading to be turned off

45957cc

No longer requires OpenMP to compile Define the following in bli_config.h in order to enable multithreading: BLIS_ENABLE_MULTITHREADING BLIS_ENABLE_OPENMP Also fixes a bug with bli_get_range_weighted

Fixed bug with bli_get_range_weighted

13a4c71

Disabled parallelism for right-sided TRMM JC loop

5c048a9

The loop has dependent iterations.

Fixed bug with disabling JC loop threading for right sided trmm

0b4b168

Fixed rounding error in bli_get_range_weighted

8a0ef0e

Reverting changes dunnington and reference configs

21fb089

Now they are unchanged from the main branch of BLIS

fgvanzee added a commit that referenced this pull request May 20, 2014

Merge pull request #8 from tlrmchlsmth/master

77a2d8d

Added multithreading to most level-3 operations.

fgvanzee merged commit 77a2d8d into flame:master May 20, 2014

songmaotian mentioned this pull request Apr 22, 2016

multi thread crash #68

Closed

heroxbd mentioned this pull request Feb 20, 2017

test segment fault on Intel Knight Landing #116

Closed

loveshack mentioned this pull request Mar 5, 2018

knl build fails #169

Closed

loveshack mentioned this pull request Apr 5, 2018

Add skx, knl to x86_64 configuration family #183

Closed

RajalakshmiSR mentioned this pull request Mar 24, 2022

Tests fail on POWER machines #621

Closed

Aaron-Hutchinson referenced this pull request in sifive/sifive-blis Apr 4, 2023

Add amaxv, copy, setv, swapv (#8)

40a6c02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multithreading to most level-3 operations #8

Add multithreading to most level-3 operations #8

Uh oh!

tlrmchlsmth commented May 13, 2014

Uh oh!

Uh oh!

Add multithreading to most level-3 operations #8

Add multithreading to most level-3 operations #8

Uh oh!

Conversation

tlrmchlsmth commented May 13, 2014

Uh oh!

Uh oh!