Mis-balanced work between JR/IR threads in edge-macro-blocks #437

hominhquan · 2020-08-12T14:02:34Z

Let call macro-block an MC-by-NC block, which is traditionally processed by a call to macro-kernel
Let call edge-macro-block either an ME-by-NC or an MC-by-NE block, where 0 < ME < MC and 0 < NE < NC (E stands for edge). In short, edge-macro-blocks occur when the C matrix size (M or N) is not multiple of MC (or NC).

I observed the mis-balancing in my old version of BLIS on edge-macro-blocks . After checking with the latest version of BLIS, even with the slab or round-robin dispatch, I think the issue still remains when all following conditions are met:

Edge-macro-block occurs, where n_iter (or m_iter) in the macro-kernel will be smaller than the normal NC/NR (or MC/MR).
User set BLIS_JR_NT > 1, for example BLIS_JR_NT = NC/NR (one n_iter per jr_thread) or BLIS_JR_NT = NC/(2*NR) (two n_iter per jr_thread). Idem for BLIS_IR_NT > 1, for example BLIS_IR_NT = MC/MR (one m_iter per ir_thread) and so on.
User tuned all BLIS_*_NT so that their product is equal to the number of physical cores.

If these conditions are true, we can have for example an edge-macro-block with n_iter smaller than the user-defined BLIS_JR_NT. Since all those threads are spawned, the "trailing" JR threads will standby looking at a few other JR threads working on edge-n_iter blocks. The situation is also the same for the IR-loop. And the slab/rr dispatch can not help in this case, since the work is dispatched separately between JR-loop and IR-loop. As a result, we can have up to half of cores (or even worse) not doing any computation.

One solution I can see is to detect this edge case and to dynamically "fusion" the two loops into a flatten 1D workspace of c11 micro-blocks and dispatch them equally on jr_nt * ir_nt threads

For instance, I did this on the gemm macro-kernel in my BLIS version 0.2.2 and it fixes the mis-balancing issue:

	/* fused num_threads in JR and IR */ 
	dim_t num_threads = jr_num_threads * ir_num_threads; 
	
	/* fused number of c11 micro-blocks */
	dim_t num_blocks = m_iter * n_iter; 
	dim_t num_blocks_per_thread = num_blocks / num_threads; 

	/* possible trailing c11 blocks */
	dim_t num_blocks_trailing = num_blocks % num_threads;

	/* my thread id in the fused thread-domain */
	dim_t ithread_ji = ir_thread_id * jr_num_threads + jr_thread_id; 

	/* index of my first c11 block in a flatten 1D range of blocks */
	dim_t iblock_begin = ithread_ji * num_blocks_per_thread; 
	/* nudge my first block in case of num_blocks_trailing */
	iblock_begin += bli_min(ithread_ji, num_blocks_trailing); 

	/* index of my last c11 block in a flatten 1D range of blocks */
	dim_t iblock_end = iblock_begin + num_blocks_per_thread; 
	/* nudge my last block in case of num_blocks_trailing */
	iblock_end += ((ithread_ji < num_blocks_trailing) ? 1 : 0); 

	/* Next panels */ 
	dim_t jnext = iblock_begin % n_iter; 
	dim_t inext = iblock_begin / n_iter; 
	ctype* restrict a2 = a_cast + inext * rstep_a; 
	ctype* restrict b2 = b_cast + jnext * cstep_b; 

	/* Flatten 1D loop of fused JR/IR loops */ 
	for (dim_t iblock = iblock_begin; iblock < iblock_end; iblock++) 
	{
		/* copy next to current and compute next panels */
		j = jnext; 
		i = inext; 
		a1 = a2; 
		b1 = b2; 

		/* Next panels */ 
		jnext = (iblock + 1) % n_iter; 
		inext = (iblock + 1) / n_iter; 
		a2 = a_cast + inext * rstep_a; 
		b2 = b_cast + jnext * cstep_b; 

		... 
	}

The text was updated successfully, but these errors were encountered:

devinamatthews · 2020-08-12T14:10:58Z

@hominhquan thanks for your analysis. In practice, BLIS_IR_NT is always 1, as threading this loop just doesn't make sense on any architecture we've seen. Without diving into the details on this, can you comment whether an issue still exists if we assume BLIS_IR_NT == 1?

devinamatthews · 2020-08-12T14:12:29Z

I should also add that realistically BLIS_JR_NT <= 4.

hominhquan · 2020-08-12T14:14:01Z

@hominhquan thanks for your analysis. In practice, BLIS_IR_NT is always 1, as threading this loop just doesn't make sense on any architecture we've seen. Without diving into the details on this, can you comment whether an issue still exists if we assume BLIS_IR_NT == 1?

@devinamatthews Yes, I observed indeed the issue on the JR loop with JR_NT bigger than edge-n_iter and IR_NT = 1. The IR-loop was just a generalized case from JR-loop.

devinamatthews · 2020-08-12T14:18:16Z

OK. Do you have any performance numbers? It sounds like we can improve general parallel performance then.

hominhquan · 2020-08-12T14:18:38Z

I give an example of mis-balancing : MC = NC = 256, MR = 8, NR = 16, M = N = 3000
Possible edge-macro-block is (3000 % 256): 184-by-184 => m_iter = 23, n_iter = 12 (ceil(11.5))
If I set BLIS_JR_NT = 16, there will be only 12 threads working on the JR-loops, and 4 threads standby.

hominhquan · 2020-08-12T14:24:47Z

OK. Do you have any performance numbers? It sounds like we can improve general parallel performance then.

For example, on the Kalray MPPA3 processor, I get a speedup x1.3 on the macro-kernel itself. the performance gain depends closely on the matrix size and MC/NC parameters. It is hard to assess a number-ed gain on an architecture X, but I believe it can improve the overall parallel scheme. That's why I opened this ticket.

hominhquan · 2020-08-12T14:57:14Z

In fact, we can re-use the current slab/rr dispatch, but on the fused workspace:

/* construct a fused JR/IR thread_info */
thrinfo_t thread_jrir = ... ;
bli_thread_range_jrir( &thread_jrir, n_iter * m_iter, 1, FALSE, &jrir_start, &jrir_end, &jrir_inc ); 

/* fused loop */
for ( ... )
{
   ... 
}

Details: - In some multi-threading schemes, JR_NT and IR_NT may produce idle threads not performing any computation. - This commits detect such situation and implement a collapse of JR/IR loops.

devinamatthews · 2022-03-14T15:05:02Z

@hominhquan is the general issue that BLIS_JR_NT should be a divisor of BLIS_NC/BLIS_NR? As I mentioned before, BLIS_JR_NT also shouldn't be very large, maybe 4 at most.

hominhquan · 2022-03-14T15:22:11Z

@devinamatthews Yes, but the full condition should be:

BLIS_JR_NT ideally be divisor of BLIS_NC/BLIS_NR, and
BLIS_IR_NT ideally be divisor of BLIS_MC/BLIS_MR

BLIS_JR_NT also shouldn't be very large, maybe 4 at most.

Unless BLIS implicitly re-tweaks user's BLIS_JR_NT and BLIS_IR_NT to respect your mentioned condition, a user lamda is free to set these values to his convenience, including using global env var or local rntm_t ?

And I remind that this issue is not manifested in all cases. I reput here an example :

MC = NC = 256, MR = 8, NR = 16, M = N = 3000
Possible edge-macro-block is (3000 % 256): 184-by-184 => m_iter = 23, n_iter = 12 (ceil(11.5))
If  BLIS_JR_NT = 16, there will be only 12 threads working on the JR-loops, and 4 threads standby.
The same for the IR-loop.

devinamatthews · 2022-03-14T15:30:20Z

No BLIS can't override the user's choices, but I guess then it's up to the user to not do that. I just added some notes to the Multithreading documentation.

fgvanzee · 2022-03-14T20:48:43Z

I'm a bit confused as to what this issue is really about. And to the extent that I understand it, I'm not quite sure why this issue was labeled as a bug.

I'm not sure if this is causing any confusion, but let's all remember that matrix dimensions are first partitioned for parallelism (oblivious to cache blocksizes), and then each thread (or thread group) applies its cache blocksizes to whatever portion of the partitioned dimension it is assigned. (And if there is not enough data along a dimension so that all threads get at least one micropanel of work, then some threads will idle.)

hominhquan · 2022-03-15T08:28:19Z

@fgvanzee You are right, this is, functionality-speaking, not a bug, but a sub-optimiality in micro-kernels dispatch. As you said, it is possible to have idle threads not doing any computation.

devinamatthews · 2022-07-22T20:05:00Z

@hominhquan just had in interesting conversation where some cases were pointed out where BLIS_IR_NT > 1 make sense performance-wise. Does BLIS_IR_NT = BLIS_JR_NT = 4 give you reasonable performance and no idling threads?

Details: - Reimplemented parallelization of the JR loop in gemmt (which is recycled for herk, her2k, syrk, and syr2k). Previously, the rectangular region of the current MC x NC panel of C would be parallelized separately from from the diagonal region of that same submatrix, with the rectangular portion being assigned to threads via slab or round-robin (rr) partitioning (as determined at configure- time) and the diagonal region being assigned via round-robin. This approach did not work well when extracting lots of parallelism from the JR loop and was often suboptimal even for smaller degrees of parallelism. This commit implements tile-level load balancing (tlb) in which the IR loop is effectively subjugated in service of more equitably dividing work in the JR loop. This approach is especially potent for certain situations where the diagonal region of the MC x NR panel of C are significant relative to the entire region. However, it also seems to benefit many problem sizes of other level-3 operations (excluding trsm, which has an inherent algorithmic dependency in the IR loop that prevents the application of tlb). For now, tlb is implemented as _var2b.c macrokernels for gemm (which forms the basis for gemm, hemm, and symm), gemmt (which forms the basis of herk, her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and trmm3). Which function pointers (_var2() or _var2b()) are embedded in the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp macro is defined, which is controlled by the value passed to the existing --thread-part-jrir=METHOD (or -r METHOD) configure option. This script adds 'tlb' as a valid option alongside the previously supported values of 'slab' and 'rr'. ('tlb' is now the default.) Thanks to Leick Robinson for abstractly inspiring this work, and to Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437) about the possibility of improved load balance in macrokernel loops, and even prototyping what it might look like, long before I fully understood the problem. - In bli_thread_range_weighted_sub(), tweaked the the way we compute the area of the current MC x NC trapezoidal panel of C by better taking into account the microtile structure along the diagonal. Previously, it was an underestimate, as it assumed MR = NR = 1 (that is, it assumed that the microtile column of C that overlapped with microtiles exactly coincided with the diagonal). Now, we only assume MR = NR. This is still a slight underestimate when MR != NR, so the additional area is scaled by 1.5 in a hackish attempt to compensate for this, as well as other additional effects that are difficult to model (such as the increased cost of writing to temporary tiles before finally updating C). The net effect of this better estimation of the trapezoidal area should be (on average) slightly larger regions assigned to threads that have little or no overlap with the diagonal region (and correspondingly slightly smaller regions in the diagonal region), which we expect will lead to slightly better load balancing in most situations. - Spun off the contents of bli_thread.[ch] that relate to computing thread ranges into one of three source/header file pairs: - bli_thread_range.[ch], which define functions that are not specific to the jr/ir loops; - bli_thread_range_slab_rr.[ch], which define functions that implement slab or round-robin partitioning for the jr/ir loops; - bli_thread_range_tlb.[ch], which define functions that implement tlb for the jr/ir loops. - Fixed the computation of a_next in the last iteration of the IR loop in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around to the first micropanel of the current MC x KC packed block of A. However, this is almost never actually the micropanel that is used next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use in the upper-stored case (which *does* actually always choose the first micropanel of A as its a_next at the end of the IR loop). - Removed adjustments for a_next/b_next (a2/b2) for the diagonal- intersecting case of gemmt_l_ker_var2() and the above-diagonal case of gemmt_u_ker_var2() since these cases will only coincide with the last iteration of the IR loop in very small problems. - Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of which explicitly considers whether the current microtile is the last tile that intersects the diagonal. (The former does the same, but the computation coincides with the original bli_is_last_iter().) These functions are now used in gemmt to test when a_next (or a2) should "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l() and bli_is_last_iter_tlb_u(), which are similar to the aforementioned functions but are used when employing tlb in gemmt. - Redefined macros in bli_packm_thrinfo.h, which test whether an iteration of work is assigned to a thread, as static inline functions in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h). In the process of redefining these macros, I also renamed them from bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl(). - Renamed bli_thread_range_jrir_rr() -> bli_thread_range_rr() bli_thread_range_jrir_sl() -> bli_thread_range_sl() bli_thread_range_jrir() -> bli_thread_range_slrr() - Renamed bli_is_last_iter() -> bli_is_last_iter_slrr() - Defined bli_info_get_thread_jrir_tlb() and renamed: - bli_info_get_thread_part_jrir_slab() -> bli_info_get_thread_jrir_slab() - bli_info_get_thread_part_jrir_rr() -> bli_info_get_thread_jrir_rr() - Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism into the JR loop when tlb is enabled for non-trsm level-3 operations. - Added a sanity check to prevent bli_prune_unref_mparts() from being used on packed objects. This prohibition is necessary because the current implementation does not take into account the atomicity of packed micropanel widths relative to the diagonal of structured matrices. That is, the function prunes greedily without regard to whether doing so would prune off part of a micropanel *which has already been packed* and assigned to a thread for inclusion in the computation. - Further restricted early returns in bli_prune_unref_mparts() to situations where the primary matrix is not only of general structure but also dense (in terms of its uplo_t value). The addition of the matrix's dense-ness to the conditional is required because gemmt is somewhat unusual in that its C matrix has general structure but is marked as lower- or upper-stored via its uplo_t. By only checking for general structure, attempts to prune gemmt C matrices would incorrectly result in early returns, even though that operation effectively treats the matrix as symmetric (and stored in only one triangle). - Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges were computed when 1 < bf. Thankfully, this bug was not yet manifesting since all current invocations used bf == 1. - Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2() that would perform incorrect pruning of unreferenced regions above where the diagonal of a lower-stored matrix intersects the right edge. Thankfully, the bug was not harming anything since those unreferenced regions were being pruned prior to the macrokernel. - Rewrote slab/rr-based gemmt macrokernels so that they no longer carved C into rectangular and diagonal regions prior to parallelizing each separately. The new macrokernels use a unified loop structure where quadratic (slab) partitioning is used. - Updated all level-3 macrokernels to have a more uniform coding style, such as wrt combining variable declarations with initializations as well as the use of const. - Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that corresponded to functions that were removed in aeb5f0c. - Other very minor cleanups. - Comment updates.

Details: - Reimplemented parallelization of the JR loop in gemmt (which is recycled for herk, her2k, syrk, and syr2k). Previously, the rectangular region of the current MC x NC panel of C would be parallelized separately from from the diagonal region of that same submatrix, with the rectangular portion being assigned to threads via slab or round-robin (rr) partitioning (as determined at configure- time) and the diagonal region being assigned via round-robin. This approach did not work well when extracting lots of parallelism from the JR loop and was often suboptimal even for smaller degrees of parallelism. This commit implements tile-level load balancing (tlb) in which the IR loop is effectively subjugated in service of more equitably dividing work in the JR loop. This approach is especially potent for certain situations where the diagonal region of the MC x NR panel of C are significant relative to the entire region. However, it also seems to benefit many problem sizes of other level-3 operations (excluding trsm, which has an inherent algorithmic dependency in the IR loop that prevents the application of tlb). For now, tlb is implemented as _var2b.c macrokernels for gemm (which forms the basis for gemm, hemm, and symm), gemmt (which forms the basis of herk, her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and trmm3). Which function pointers (_var2() or _var2b()) are embedded in the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp macro is defined, which is controlled by the value passed to the existing --thread-part-jrir=METHOD (or -r METHOD) configure option. This script adds 'tlb' as a valid option alongside the previously supported values of 'slab' and 'rr'. ('slab' is still the default.) Thanks to Leick Robinson for abstractly inspiring this work, and to Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437) about the possibility of improved load balance in macrokernel loops, and even prototyping what it might look like, long before I fully understood the problem. - In bli_thread_range_weighted_sub(), tweaked the the way we compute the area of the current MC x NC trapezoidal panel of C by better taking into account the microtile structure along the diagonal. Previously, it was an underestimate, as it assumed MR = NR = 1 (that is, it assumed that the microtile column of C that overlapped with microtiles exactly coincided with the diagonal). Now, we only assume MR = NR. This is still a slight underestimate when MR != NR, so the additional area is scaled by 1.5 in a hackish attempt to compensate for this, as well as other additional effects that are difficult to model (such as the increased cost of writing to temporary tiles before finally updating C). The net effect of this better estimation of the trapezoidal area should be (on average) slightly larger regions assigned to threads that have little or no overlap with the diagonal region (and correspondingly slightly smaller regions in the diagonal region), which we expect will lead to slightly better load balancing in most situations. - Spun off the contents of bli_thread.[ch] that relate to computing thread ranges into one of three source/header file pairs: - bli_thread_range.[ch], which define functions that are not specific to the jr/ir loops; - bli_thread_range_slab_rr.[ch], which define functions that implement slab or round-robin partitioning for the jr/ir loops; - bli_thread_range_tlb.[ch], which define functions that implement tlb for the jr/ir loops. - Fixed the computation of a_next in the last iteration of the IR loop in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around to the first micropanel of the current MC x KC packed block of A. However, this is almost never actually the micropanel that is used next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use in the upper-stored case (which *does* actually always choose the first micropanel of A as its a_next at the end of the IR loop). - Removed adjustments for a_next/b_next (a2/b2) for the diagonal- intersecting case of gemmt_l_ker_var2() and the above-diagonal case of gemmt_u_ker_var2() since these cases will only coincide with the last iteration of the IR loop in very small problems. - Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of which explicitly considers whether the current microtile is the last tile that intersects the diagonal. (The former does the same, but the computation coincides with the original bli_is_last_iter().) These functions are now used in gemmt to test when a_next (or a2) should "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l() and bli_is_last_iter_tlb_u(), which are similar to the aforementioned functions but are used when employing tlb in gemmt. - Redefined macros in bli_packm_thrinfo.h, which test whether an iteration of work is assigned to a thread, as static inline functions in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h). In the process of redefining these macros, I also renamed them from bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl(). - Renamed bli_thread_range_jrir_rr() -> bli_thread_range_rr() bli_thread_range_jrir_sl() -> bli_thread_range_sl() bli_thread_range_jrir() -> bli_thread_range_slrr() - Renamed bli_is_last_iter() -> bli_is_last_iter_slrr() - Defined bli_info_get_thread_jrir_tlb() and renamed: - bli_info_get_thread_part_jrir_slab() -> bli_info_get_thread_jrir_slab() - bli_info_get_thread_part_jrir_rr() -> bli_info_get_thread_jrir_rr() - Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism into the JR loop when tlb is enabled for non-trsm level-3 operations. - Added a sanity check to prevent bli_prune_unref_mparts() from being used on packed objects. This prohibition is necessary because the current implementation does not take into account the atomicity of packed micropanel widths relative to the diagonal of structured matrices. That is, the function prunes greedily without regard to whether doing so would prune off part of a micropanel *which has already been packed* and assigned to a thread for inclusion in the computation. - Further restricted early returns in bli_prune_unref_mparts() to situations where the primary matrix is not only of general structure but also dense (in terms of its uplo_t value). The addition of the matrix's dense-ness to the conditional is required because gemmt is somewhat unusual in that its C matrix has general structure but is marked as lower- or upper-stored via its uplo_t. By only checking for general structure, attempts to prune gemmt C matrices would incorrectly result in early returns, even though that operation effectively treats the matrix as symmetric (and stored in only one triangle). - Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges were computed when 1 < bf. Thankfully, this bug was not yet manifesting since all current invocations used bf == 1. - Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2() that would perform incorrect pruning of unreferenced regions above where the diagonal of a lower-stored matrix intersects the right edge. Thankfully, the bug was not harming anything since those unreferenced regions were being pruned prior to the macrokernel. - Rewrote slab/rr-based gemmt macrokernels so that they no longer carved C into rectangular and diagonal regions prior to parallelizing each separately. The new macrokernels use a unified loop structure where quadratic (slab) partitioning is used. - Updated all level-3 macrokernels to have a more uniform coding style, such as wrt combining variable declarations with initializations as well as the use of const. - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. This change probably should have been included in aeb5f0c. - Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that corresponded to functions that were removed in aeb5f0c. - Other very minor cleanups. - Comment updates.

Details: - Reimplemented parallelization of the JR loop in gemmt (which is recycled for herk, her2k, syrk, and syr2k). Previously, the rectangular region of the current MC x NC panel of C would be parallelized separately from from the diagonal region of that same submatrix, with the rectangular portion being assigned to threads via slab or round-robin (rr) partitioning (as determined at configure- time) and the diagonal region being assigned via round-robin. This approach did not work well when extracting lots of parallelism from the JR loop and was often suboptimal even for smaller degrees of parallelism. This commit implements tile-level load balancing (tlb) in which the IR loop is effectively subjugated in service of more equitably dividing work in the JR loop. This approach is especially potent for certain situations where the diagonal region of the MC x NR panel of C are significant relative to the entire region. However, it also seems to benefit many problem sizes of other level-3 operations (excluding trsm, which has an inherent algorithmic dependency in the IR loop that prevents the application of tlb). For now, tlb is implemented as _var2b.c macrokernels for gemm (which forms the basis for gemm, hemm, and symm), gemmt (which forms the basis of herk, her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and trmm3). Which function pointers (_var2() or _var2b()) are embedded in the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp macro is defined, which is controlled by the value passed to the existing --thread-part-jrir=METHOD (or -r METHOD) configure option. This script adds 'tlb' as a valid option alongside the previously supported values of 'slab' and 'rr'. ('slab' is still the default.) Thanks to Leick Robinson for abstractly inspiring this work, and to Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437) about the possibility of improved load balance in macrokernel loops, and even prototyping what it might look like, long before I fully understood the problem. - In bli_thread_range_weighted_sub(), tweaked the the way we compute the area of the current MC x NC trapezoidal panel of C by better taking into account the microtile structure along the diagonal. Previously, it was an underestimate, as it assumed MR = NR = 1 (that is, it assumed that the microtile column of C that overlapped with microtiles exactly coincided with the diagonal). Now, we only assume MR = NR. This is still a slight underestimate when MR != NR, so the additional area is scaled by 1.5 in a hackish attempt to compensate for this, as well as other additional effects that are difficult to model (such as the increased cost of writing to temporary tiles before finally updating C). The net effect of this better estimation of the trapezoidal area should be (on average) slightly larger regions assigned to threads that have little or no overlap with the diagonal region (and correspondingly slightly smaller regions in the diagonal region), which we expect will lead to slightly better load balancing in most situations. - Spun off the contents of bli_thread.[ch] that relate to computing thread ranges into one of three source/header file pairs: - bli_thread_range.[ch], which define functions that are not specific to the jr/ir loops; - bli_thread_range_slab_rr.[ch], which define functions that implement slab or round-robin partitioning for the jr/ir loops; - bli_thread_range_tlb.[ch], which define functions that implement tlb for the jr/ir loops. - Fixed the computation of a_next in the last iteration of the IR loop in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around to the first micropanel of the current MC x KC packed block of A. However, this is almost never actually the micropanel that is used next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use in the upper-stored case (which *does* actually always choose the first micropanel of A as its a_next at the end of the IR loop). - Removed adjustments for a_next/b_next (a2/b2) for the diagonal- intersecting case of gemmt_l_ker_var2() and the above-diagonal case of gemmt_u_ker_var2() since these cases will only coincide with the last iteration of the IR loop in very small problems. - Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of which explicitly considers whether the current microtile is the last tile that intersects the diagonal. (The former does the same, but the computation coincides with the original bli_is_last_iter().) These functions are now used in gemmt to test when a_next (or a2) should "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l() and bli_is_last_iter_tlb_u(), which are similar to the aforementioned functions but are used when employing tlb in gemmt. - Redefined macros in bli_packm_thrinfo.h, which test whether an iteration of work is assigned to a thread, as static inline functions in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h). In the process of redefining these macros, I also renamed them from bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl(). - Renamed bli_thread_range_jrir_rr() -> bli_thread_range_rr() bli_thread_range_jrir_sl() -> bli_thread_range_sl() bli_thread_range_jrir() -> bli_thread_range_slrr() - Renamed bli_is_last_iter() -> bli_is_last_iter_slrr() - Defined bli_info_get_thread_jrir_tlb() and renamed: - bli_info_get_thread_part_jrir_slab() -> bli_info_get_thread_jrir_slab() - bli_info_get_thread_part_jrir_rr() -> bli_info_get_thread_jrir_rr() - Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism into the JR loop when tlb is enabled for non-trsm level-3 operations. - Added a sanity check to prevent bli_prune_unref_mparts() from being used on packed objects. This prohibition is necessary because the current implementation does not take into account the atomicity of packed micropanel widths relative to the diagonal of structured matrices. That is, the function prunes greedily without regard to whether doing so would prune off part of a micropanel *which has already been packed* and assigned to a thread for inclusion in the computation. - Further restricted early returns in bli_prune_unref_mparts() to situations where the primary matrix is not only of general structure but also dense (in terms of its uplo_t value). The addition of the matrix's dense-ness to the conditional is required because gemmt is somewhat unusual in that its C matrix has general structure but is marked as lower- or upper-stored via its uplo_t. By only checking for general structure, attempts to prune gemmt C matrices would incorrectly result in early returns, even though that operation effectively treats the matrix as symmetric (and stored in only one triangle). - Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges were computed when 1 < bf. Thankfully, this bug was not yet manifesting since all current invocations used bf == 1. - Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2() that would perform incorrect pruning of unreferenced regions above where the diagonal of a lower-stored matrix intersects the right edge. Thankfully, the bug was not harming anything since those unreferenced regions were being pruned prior to the macrokernel. - Rewrote slab/rr-based gemmt macrokernels so that they no longer carved C into rectangular and diagonal regions prior to parallelizing each separately. The new macrokernels use a unified loop structure where quadratic (slab) partitioning is used. - Updated all level-3 macrokernels to have a more uniform coding style, such as wrt combining variable declarations with initializations as well as the use of const. - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. This change probably should have been included in aeb5f0c. - Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that corresponded to functions that were removed in aeb5f0c. - Other very minor cleanups. - Comment updates. - (cherry picked from commit 2e1ba9d)

devinamatthews added the bug label May 5, 2021

devinamatthews mentioned this issue Jul 12, 2022

Detect and deal with mis-balancing in GEMM macro-kernel (#437) #562

Open

fgvanzee mentioned this issue Dec 9, 2022

Tile-level partitioning in jr/ir loops (ex-trsm). #695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mis-balanced work between JR/IR threads in edge-macro-blocks #437

Mis-balanced work between JR/IR threads in edge-macro-blocks #437

hominhquan commented Aug 12, 2020

devinamatthews commented Aug 12, 2020

devinamatthews commented Aug 12, 2020

hominhquan commented Aug 12, 2020

devinamatthews commented Aug 12, 2020

hominhquan commented Aug 12, 2020

hominhquan commented Aug 12, 2020 •

edited

Loading

hominhquan commented Aug 12, 2020 •

edited

Loading

devinamatthews commented Mar 14, 2022

hominhquan commented Mar 14, 2022 •

edited

Loading

devinamatthews commented Mar 14, 2022

fgvanzee commented Mar 14, 2022 •

edited

Loading

hominhquan commented Mar 15, 2022

devinamatthews commented Jul 22, 2022

Mis-balanced work between JR/IR threads in edge-macro-blocks #437

Mis-balanced work between JR/IR threads in edge-macro-blocks #437

Comments

hominhquan commented Aug 12, 2020

devinamatthews commented Aug 12, 2020

devinamatthews commented Aug 12, 2020

hominhquan commented Aug 12, 2020

devinamatthews commented Aug 12, 2020

hominhquan commented Aug 12, 2020

hominhquan commented Aug 12, 2020 • edited Loading

hominhquan commented Aug 12, 2020 • edited Loading

devinamatthews commented Mar 14, 2022

hominhquan commented Mar 14, 2022 • edited Loading

devinamatthews commented Mar 14, 2022

fgvanzee commented Mar 14, 2022 • edited Loading

hominhquan commented Mar 15, 2022

devinamatthews commented Jul 22, 2022

hominhquan commented Aug 12, 2020 •

edited

Loading

hominhquan commented Aug 12, 2020 •

edited

Loading

hominhquan commented Mar 14, 2022 •

edited

Loading

fgvanzee commented Mar 14, 2022 •

edited

Loading