Skip to content

Tile-level partitioning in jr/ir loops (ex-trsm). #695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 11, 2023
Merged

Tile-level partitioning in jr/ir loops (ex-trsm). #695

merged 10 commits into from
Jan 11, 2023

Conversation

fgvanzee
Copy link
Member

@fgvanzee fgvanzee commented Dec 9, 2022

Details:

  • Reimplemented parallelization of the JR loop in gemmt (which is recycled for herk, her2k, syrk, and syr2k). Previously, the rectangular region of the current MC x NC panel of C would be parallelized separately from from the diagonal region of that same submatrix, with the rectangular portion being assigned to threads via slab or round-robin (rr) partitioning (as determined at configure- time) and the diagonal region being assigned via round-robin. This approach did not work well when extracting lots of parallelism from the JR loop and was often suboptimal even for smaller degrees of parallelism. This commit implements tile-level load balancing (tlb) in which the IR loop is effectively subjugated in service of more equitably dividing work in the JR loop. This approach is especially potent for certain situations where the diagonal region of the MC x NR panel of C are significant relative to the entire region. However, it also seems to benefit many problem sizes of other level-3 operations (excluding trsm, which has an inherent algorithmic dependency in the IR loop that prevents the application of tlb). For now, tlb is implemented as _var2b.c macrokernels for gemm (which forms the basis for gemm, hemm, and symm), gemmt (which forms the basis of herk, her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and trmm3). Which function pointers (_var2() or _var2b()) are embedded in the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp macro is defined, which is controlled by the value passed to the existing --thread-part-jrir=METHOD (or -r METHOD) configure option. This script adds 'tlb' as a valid option alongside the previously supported values of 'slab' and 'rr'. ('tlb' is now the default.) Thanks to Leick Robinson for abstractly inspiring this work, and to Minh Quan Ho for inquiring (in PR Detect and deal with mis-balancing in GEMM macro-kernel (#437) #562, and before that in Issue Mis-balanced work between JR/IR threads in edge-macro-blocks #437) about the possibility of improved load balance in macrokernel loops, and even prototyping what it might look like, long before I fully understood the problem.
  • In bli_thread_range_weighted_sub(), tweaked the the way we compute the area of the current MC x NC trapezoidal panel of C by better taking into account the microtile structure along the diagonal. Previously, it was an underestimate, as it assumed MR = NR = 1 (that is, it assumed that the microtile column of C that overlapped with microtiles exactly coincided with the diagonal). Now, we only assume MR = NR. This is still a slight underestimate when MR != NR, so the additional area is scaled by 1.5 in a hackish attempt to compensate for this, as well as other additional effects that are difficult to model (such as the increased cost of writing to temporary tiles before finally updating C). The net effect of this better estimation of the trapezoidal area should be (on average) slightly larger regions assigned to threads that have little or no overlap with the diagonal region (and correspondingly slightly smaller regions in the diagonal region), which we expect will lead to slightly better load balancing in most situations.
  • Spun off the contents of bli_thread.[ch] that relate to computing thread ranges into one of three source/header file pairs:
    • bli_thread_range.[ch], which define functions that are not specific to the jr/ir loops;
    • bli_thread_range_slab_rr.[ch], which define functions that implement slab or round-robin partitioning for the jr/ir loops;
    • bli_thread_range_tlb.[ch], which define functions that implement tlb for the jr/ir loops.
  • Fixed the computation of a_next in the last iteration of the IR loop in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around to the first micropanel of the current MC x KC packed block of A. However, this is almost never actually the micropanel that is used next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use in the upper-stored case (which does actually always choose the first micropanel of A as its a_next at the end of the IR loop).
  • Removed adjustments for a_next/b_next (a2/b2) for the diagonal- intersecting case of gemmt_l_ker_var2() and the above-diagonal case of gemmt_u_ker_var2() since these cases will only coincide with the last iteration of the IR loop in very small problems.
  • Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of which explicitly considers whether the current microtile is the last tile that intersects the diagonal. (The former does the same, but the computation coincides with the original bli_is_last_iter().) These functions are now used in gemmt to test when a_next (or a2) should "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l() and bli_is_last_iter_tlb_u(), which are similar to the aforementioned functions but are used when employing tlb in gemmt.
  • Redefined macros in bli_packm_thrinfo.h, which test whether an iteration of work is assigned to a thread, as static inline functions in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h). In the process of redefining these macros, I also renamed them from bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().
  • Renamed
    • bli_thread_range_jrir_rr() -> bli_thread_range_rr()
    • bli_thread_range_jrir_sl() -> bli_thread_range_sl()
    • bli_thread_range_jrir() -> bli_thread_range_slrr()
  • Renamed
    • bli_is_last_iter() -> bli_is_last_iter_slrr()
  • Defined
    • bli_info_get_thread_jrir_tlb()
      and renamed:
    • bli_info_get_thread_part_jrir_slab() -> bli_info_get_thread_jrir_slab()
    • bli_info_get_thread_part_jrir_rr() -> bli_info_get_thread_jrir_rr()
  • Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism into the JR loop when tlb is enabled for non-trsm level-3 operations.
  • Added a sanity check to prevent bli_prune_unref_mparts() from being used on packed objects. This prohibition is necessary because the current implementation does not take into account the atomicity of packed micropanel widths relative to the diagonal of structured matrices. That is, the function prunes greedily without regard to whether doing so would prune off part of a micropanel which has already been packed and assigned to a thread for inclusion in the computation.
  • Further restricted early returns in bli_prune_unref_mparts() to situations where the primary matrix is not only of general structure but also dense (in terms of its uplo_t value). The addition of the matrix's dense-ness to the conditional is required because gemmt is somewhat unusual in that its C matrix has general structure but is marked as lower- or upper-stored via its uplo_t. By only checking for general structure, attempts to prune gemmt C matrices would incorrectly result in early returns, even though that operation effectively treats the matrix as symmetric (and stored in only one triangle).
  • Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges were computed when 1 < bf. Thankfully, this bug was not yet manifesting since all current invocations used bf == 1.
  • Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2() that would perform incorrect pruning of unreferenced regions above where the diagonal of a lower-stored matrix intersects the right edge. Thankfully, the bug was not harming anything since those unreferenced regions were being pruned prior to the macrokernel.
  • Rewrote slab/rr-based gemmt macrokernels so that they no longer carved C into rectangular and diagonal regions prior to parallelizing each separately. The new macrokernels use a unified loop structure where quadratic (slab) partitioning is used.
  • Updated all level-3 macrokernels to have a more uniform coding style, such as wrt combining variable declarations with initializations as well as the use of const.
  • Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that corresponded to functions that were removed in aeb5f0c.
  • Other very minor cleanups.
  • Comment updates.

@hominhquan You'll be interested in this PR. 🎉

Details:
- Reimplemented parallelization of the JR loop in gemmt (which is
  recycled for herk, her2k, syrk, and syr2k). Previously, the
  rectangular region of the current MC x NC panel of C would be
  parallelized separately from from the diagonal region of that same
  submatrix, with the rectangular portion being assigned to threads via
  slab or round-robin (rr) partitioning (as determined at configure-
  time) and the diagonal region being assigned via round-robin. This
  approach did not work well when extracting lots of parallelism from
  the JR loop and was often suboptimal even for smaller degrees of
  parallelism. This commit implements tile-level load balancing (tlb) in
  which the IR loop is effectively subjugated in service of more
  equitably dividing work in the JR loop. This approach is especially
  potent for certain situations where the diagonal region of the MC x NR
  panel of C are significant relative to the entire region. However, it
  also seems to benefit many problem sizes of other level-3 operations
  (excluding trsm, which has an inherent algorithmic dependency in the
  IR loop that prevents the application of tlb). For now, tlb is
  implemented as _var2b.c macrokernels for gemm (which forms the basis
  for gemm, hemm, and symm), gemmt (which forms the basis of herk,
  her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and
  trmm3). Which function pointers (_var2() or _var2b()) are embedded in
  the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp
  macro is defined, which is controlled by the value passed to the
  existing --thread-part-jrir=METHOD (or -r METHOD) configure option.
  This script adds 'tlb' as a valid option alongside the previously
  supported values of 'slab' and 'rr'. ('tlb' is now the default.)
  Thanks to Leick Robinson for abstractly inspiring this work, and to
  Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437)
  about the possibility of improved load balance in macrokernel loops,
  and even prototyping what it might look like, long before I fully
  understood the problem.
- In bli_thread_range_weighted_sub(), tweaked the the way we compute the
  area of the current MC x NC trapezoidal panel of C by better taking
  into account the microtile structure along the diagonal. Previously,
  it was an underestimate, as it assumed MR = NR = 1 (that is, it
  assumed that the microtile column of C that overlapped with microtiles
  exactly coincided with the diagonal). Now, we only assume MR = NR.
  This is still a slight underestimate when MR != NR, so the additional
  area is scaled by 1.5 in a hackish attempt to compensate for this, as
  well as other additional effects that are difficult to model (such as
  the increased cost of writing to temporary tiles before finally
  updating C). The net effect of this better estimation of the
  trapezoidal area should be (on average) slightly larger regions
  assigned to threads that have little or no overlap with the diagonal
  region (and correspondingly slightly smaller regions in the diagonal
  region), which we expect will lead to slightly better load balancing
  in most situations.
- Spun off the contents of bli_thread.[ch] that relate to computing
  thread ranges into one of three source/header file pairs:
  - bli_thread_range.[ch], which define functions that are not specific
    to the jr/ir loops;
  - bli_thread_range_slab_rr.[ch], which define functions that implement
    slab or round-robin partitioning for the jr/ir loops;
  - bli_thread_range_tlb.[ch], which define functions that implement
    tlb for the jr/ir loops.
- Fixed the computation of a_next in the last iteration of the IR loop
  in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around
  to the first micropanel of the current MC x KC packed block of A.
  However, this is almost never actually the micropanel that is used
  next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next
  correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use
  in the upper-stored case (which *does* actually always choose the
  first micropanel of A as its a_next at the end of the IR loop).
- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-
  intersecting case of gemmt_l_ker_var2() and the above-diagonal case
  of gemmt_u_ker_var2() since these cases will only coincide with the
  last iteration of the IR loop in very small problems.
- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of
  which explicitly considers whether the current microtile is the last
  tile that intersects the diagonal. (The former does the same, but the
  computation coincides with the original bli_is_last_iter().) These
  functions are now used in gemmt to test when a_next (or a2) should
  "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()
  and bli_is_last_iter_tlb_u(), which are similar to the aforementioned
  functions but are used when employing tlb in gemmt.
- Redefined macros in bli_packm_thrinfo.h, which test whether an
  iteration of work is assigned to a thread, as static inline functions
  in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).
  In the process of redefining these macros, I also renamed them from
  bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().
- Renamed
    bli_thread_range_jrir_rr() -> bli_thread_range_rr()
    bli_thread_range_jrir_sl() -> bli_thread_range_sl()
    bli_thread_range_jrir()    -> bli_thread_range_slrr()
- Renamed
    bli_is_last_iter() -> bli_is_last_iter_slrr()
- Defined
    bli_info_get_thread_jrir_tlb()
  and renamed:
  - bli_info_get_thread_part_jrir_slab() ->
    bli_info_get_thread_jrir_slab()
  - bli_info_get_thread_part_jrir_rr() ->
    bli_info_get_thread_jrir_rr()
- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism
  into the JR loop when tlb is enabled for non-trsm level-3 operations.
- Added a sanity check to prevent bli_prune_unref_mparts() from being
  used on packed objects. This prohibition is necessary because the
  current implementation does not take into account the atomicity of
  packed micropanel widths relative to the diagonal of structured
  matrices. That is, the function prunes greedily without regard to
  whether doing so would prune off part of a micropanel *which has
  already been packed* and assigned to a thread for inclusion in the
  computation.
- Further restricted early returns in bli_prune_unref_mparts() to
  situations where the primary matrix is not only of general structure
  but also dense (in terms of its uplo_t value). The addition of the
  matrix's dense-ness to the conditional is required because gemmt is
  somewhat unusual in that its C matrix has general structure but is
  marked as lower- or upper-stored via its uplo_t. By only checking
  for general structure, attempts to prune gemmt C matrices would
  incorrectly result in early returns, even though that operation
  effectively treats the matrix as symmetric (and stored in only one
  triangle).
- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges
  were computed when 1 < bf. Thankfully, this bug was not yet
  manifesting since all current invocations used bf == 1.
- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()
  that would perform incorrect pruning of unreferenced regions above
  where the diagonal of a lower-stored matrix intersects the right edge.
  Thankfully, the bug was not harming anything since those unreferenced
  regions were being pruned prior to the macrokernel.
- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved
  C into rectangular and diagonal regions prior to parallelizing each
  separately. The new macrokernels use a unified loop structure where
  quadratic (slab) partitioning is used.
- Updated all level-3 macrokernels to have a more uniform coding style,
  such as wrt combining variable declarations with initializations as
  well as the use of const.
- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that
  corresponded to functions that were removed in aeb5f0c.
- Other very minor cleanups.
- Comment updates.
Details:
- Applied the following function renames to the 'gemmlike' sandbox:
    bli_thread_range_jrir() -> bli_thread_range_slrr()
    bli_is_last_iter()      -> bli_is_last_iter_slrr()
    bli_packm_my_iter()     -> bli_is_my_iter()
  This change was intended for 62d383f.
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
  bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
  bli_thrinfo_thread_id(), respectively. This change probably should
  have been included in aeb5f0c.
@hominhquan
Copy link
Contributor

Great. I'll look deeply at this as soon as I can 👍

dim_t jr_st, ir_st; \
const dim_t n_ut_for_me
=
bli_thread_range_tlb( thread, 0, BLIS_DENSE, m, n, MR, NR, &jr_st, &ir_st );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather pass in m_iter, n_iter instead of m, n, since this function recomputes them from m, n, MR, NR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll make that change.

Details:
- Always pass in nt and tid into thread_range_tlb functions. Same with
  m_iter and n_iter. Thanks to Minh Quan Ho for suggesting this extra
  consistency.
- Inlined jr_inc, ir_inc in bli_gemm_ker_var2b.c and removed stray
  backslash.
- Added functions that are specific to lower/upper (for gemmt) and
  ll/lu/rl/ru (for trmm).
- Added const to bli_thread_range_tlb() (which now isn't even called).
Details:
- Renamed tlb implementation functions.
- Reordered functions in bli_thread_range_tlb.c.
- Whitespace changes.
@fgvanzee fgvanzee merged commit 2e1ba9d into master Jan 11, 2023
@fgvanzee fgvanzee deleted the tlb branch January 11, 2023 03:05
fgvanzee added a commit that referenced this pull request May 20, 2024
Details:
- Reimplemented parallelization of the JR loop in gemmt (which is
  recycled for herk, her2k, syrk, and syr2k). Previously, the
  rectangular region of the current MC x NC panel of C would be
  parallelized separately from from the diagonal region of that same
  submatrix, with the rectangular portion being assigned to threads via
  slab or round-robin (rr) partitioning (as determined at configure-
  time) and the diagonal region being assigned via round-robin. This
  approach did not work well when extracting lots of parallelism from
  the JR loop and was often suboptimal even for smaller degrees of
  parallelism. This commit implements tile-level load balancing (tlb) in
  which the IR loop is effectively subjugated in service of more
  equitably dividing work in the JR loop. This approach is especially
  potent for certain situations where the diagonal region of the MC x NR
  panel of C are significant relative to the entire region. However, it
  also seems to benefit many problem sizes of other level-3 operations
  (excluding trsm, which has an inherent algorithmic dependency in the
  IR loop that prevents the application of tlb). For now, tlb is
  implemented as _var2b.c macrokernels for gemm (which forms the basis
  for gemm, hemm, and symm), gemmt (which forms the basis of herk,
  her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and
  trmm3). Which function pointers (_var2() or _var2b()) are embedded in
  the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp
  macro is defined, which is controlled by the value passed to the
  existing --thread-part-jrir=METHOD (or -r METHOD) configure option.
  This script adds 'tlb' as a valid option alongside the previously
  supported values of 'slab' and 'rr'. ('slab' is still the default.)
  Thanks to Leick Robinson for abstractly inspiring this work, and to
  Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437)
  about the possibility of improved load balance in macrokernel loops,
  and even prototyping what it might look like, long before I fully
  understood the problem.
- In bli_thread_range_weighted_sub(), tweaked the the way we compute the
  area of the current MC x NC trapezoidal panel of C by better taking
  into account the microtile structure along the diagonal. Previously,
  it was an underestimate, as it assumed MR = NR = 1 (that is, it
  assumed that the microtile column of C that overlapped with microtiles
  exactly coincided with the diagonal). Now, we only assume MR = NR.
  This is still a slight underestimate when MR != NR, so the additional
  area is scaled by 1.5 in a hackish attempt to compensate for this, as
  well as other additional effects that are difficult to model (such as
  the increased cost of writing to temporary tiles before finally
  updating C). The net effect of this better estimation of the
  trapezoidal area should be (on average) slightly larger regions
  assigned to threads that have little or no overlap with the diagonal
  region (and correspondingly slightly smaller regions in the diagonal
  region), which we expect will lead to slightly better load balancing
  in most situations.
- Spun off the contents of bli_thread.[ch] that relate to computing
  thread ranges into one of three source/header file pairs:
  - bli_thread_range.[ch], which define functions that are not specific
    to the jr/ir loops;
  - bli_thread_range_slab_rr.[ch], which define functions that implement
    slab or round-robin partitioning for the jr/ir loops;
  - bli_thread_range_tlb.[ch], which define functions that implement
    tlb for the jr/ir loops.
- Fixed the computation of a_next in the last iteration of the IR loop
  in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around
  to the first micropanel of the current MC x KC packed block of A.
  However, this is almost never actually the micropanel that is used
  next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next
  correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use
  in the upper-stored case (which *does* actually always choose the
  first micropanel of A as its a_next at the end of the IR loop).
- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-
  intersecting case of gemmt_l_ker_var2() and the above-diagonal case
  of gemmt_u_ker_var2() since these cases will only coincide with the
  last iteration of the IR loop in very small problems.
- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of
  which explicitly considers whether the current microtile is the last
  tile that intersects the diagonal. (The former does the same, but the
  computation coincides with the original bli_is_last_iter().) These
  functions are now used in gemmt to test when a_next (or a2) should
  "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()
  and bli_is_last_iter_tlb_u(), which are similar to the aforementioned
  functions but are used when employing tlb in gemmt.
- Redefined macros in bli_packm_thrinfo.h, which test whether an
  iteration of work is assigned to a thread, as static inline functions
  in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).
  In the process of redefining these macros, I also renamed them from
  bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().
- Renamed
    bli_thread_range_jrir_rr() -> bli_thread_range_rr()
    bli_thread_range_jrir_sl() -> bli_thread_range_sl()
    bli_thread_range_jrir()    -> bli_thread_range_slrr()
- Renamed
    bli_is_last_iter() -> bli_is_last_iter_slrr()
- Defined
    bli_info_get_thread_jrir_tlb()
  and renamed:
  - bli_info_get_thread_part_jrir_slab() ->
    bli_info_get_thread_jrir_slab()
  - bli_info_get_thread_part_jrir_rr() ->
    bli_info_get_thread_jrir_rr()
- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism
  into the JR loop when tlb is enabled for non-trsm level-3 operations.
- Added a sanity check to prevent bli_prune_unref_mparts() from being
  used on packed objects. This prohibition is necessary because the
  current implementation does not take into account the atomicity of
  packed micropanel widths relative to the diagonal of structured
  matrices. That is, the function prunes greedily without regard to
  whether doing so would prune off part of a micropanel *which has
  already been packed* and assigned to a thread for inclusion in the
  computation.
- Further restricted early returns in bli_prune_unref_mparts() to
  situations where the primary matrix is not only of general structure
  but also dense (in terms of its uplo_t value). The addition of the
  matrix's dense-ness to the conditional is required because gemmt is
  somewhat unusual in that its C matrix has general structure but is
  marked as lower- or upper-stored via its uplo_t. By only checking
  for general structure, attempts to prune gemmt C matrices would
  incorrectly result in early returns, even though that operation
  effectively treats the matrix as symmetric (and stored in only one
  triangle).
- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges
  were computed when 1 < bf. Thankfully, this bug was not yet
  manifesting since all current invocations used bf == 1.
- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()
  that would perform incorrect pruning of unreferenced regions above
  where the diagonal of a lower-stored matrix intersects the right edge.
  Thankfully, the bug was not harming anything since those unreferenced
  regions were being pruned prior to the macrokernel.
- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved
  C into rectangular and diagonal regions prior to parallelizing each
  separately. The new macrokernels use a unified loop structure where
  quadratic (slab) partitioning is used.
- Updated all level-3 macrokernels to have a more uniform coding style,
  such as wrt combining variable declarations with initializations as
  well as the use of const.
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
  bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
  bli_thrinfo_thread_id(), respectively. This change probably should
  have been included in aeb5f0c.
- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that
  corresponded to functions that were removed in aeb5f0c.
- Other very minor cleanups.
- Comment updates.
- (cherry picked from commit 2e1ba9d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants