Test PR #4

dzambare · 2021-11-22T14:48:24Z

Experimenting merging flame/blis using github.

Details: - Switched the small block allocator (sba), as defined in bli_sba.c and bli_apool.c, to static initialization of its internal mutex. Did a similar thing for the packing block allocator (pba), which appears as global_membrk in bli_membrk.c. - Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex() to ensure they won't be used in the future. - In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp blocks guarded by BLIS_USE_PTHREAD_MUTEX.

Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates.

Details: - Removed the option to finalize BLIS after every BLAS call, which also means that BLIS would initialize at the beginning of every BLAS call. This option never really made sense and wasn't even implemented properly to begin with. (Because bli_init_auto() and _finalize_auto() were implemented in terms of bli_init_once() and _finalize_once(), respectively, the application would have only been able to call one BLAS routine before BLIS would find itself in a unusable, permanently uninitialized state.) Because this option was never meant for regular use, it never made it into configure as an actual configure-time option, and therefore this commit only removes parts of the code affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.

Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0.

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

Details: - Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that instead of returning a bool, they set a bool that is passed in by address. This does break the public exported API, but I expect very few users actually use this function. (This change is being made in preparation for a much more extensive commit relating to error checking.)

Details: - Defined getijv, setijv operations to get and set elements of a vector, in bli_setgetijv.c and .h. - Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively. - Added additional bounds checking to getijm and setijm to prevent actions with negative indices. - Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv and setijv. - Added documentation to BLISTypedAPI.md for getijm and setijm, which were inadvertently missing. - Added a new entry to the FAQ titled "Why does BLIS have vector (level-1v) and matrix (level-1m) variations of most level-1 operations?" - Comment updates.

Details: - Added new implementations of bli_slamch() and bli_dlamch() that use constants from the standard C library in lieu of dynamically-computed values (via code inherited from netlib). The previous implementation is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is defined by the subconfiguration at compile-time. Thanks to Devin Matthews for providing this patch, and to Stefano Zampini for reporting the issue (#497) that prompted Devin to propose the patch.

Details: - Defined eqsc, eqv, and eqm operations, which set a bool depending on whether the two scalars, two vectors, or two matrix operands are equal (element-wise). eqsc and eqv support implicit conjugation and eqm supports diagonal offset, diag, uplo, and trans parameters (in a manner consistent with other level-1m operations). These operations are currently housed under frame/util, at least for now, because they are not computational in nature. - Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm. - Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md. Also: - Documented getsc and setsc in both docs. - Reordered entry for setijv in BLISTypedAPI.md, and added separator bars to both docs. - Added missing "Observed object properties" clauses to various levle-1v entries in BLISObjectAPI.md. - Defined bli_apply_trans() in bli_param_macro_defs.h. - Defined supporting _check() function, bli_l0_xxbsc_check(), in bli_l0_check.c for eqsc. - Programming style and whitespace updates to bli_l1m_unb_var1.c. - Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c - Consolidated redundant macro redefinition for copym function pointer type in bli_l1m_ft.h. - Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that allow oapi and tapi source files to forego defining certain expert functions. (Certain operations such as printv and printm do not need to have both basic expert interfaces. This also includes eqsc, eqv, and eqm.)

Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be.

Details: - Added frame/include/bli_xapi_undef.h, which explicitly undefines all macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. (This is for safety and good cpp coding practice, not because it fixes anything.) - Added #include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h, bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h. - Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. - Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing that nothing in BLIS used those function pointer types. Also commented out the "#include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.

Details: - Inserted a "#include bli_xapi_undef.h" after each usage of the basic and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to the previous status quo, in which each header made minimal #undef prior to its own definitions and then a single instance of "#include bli_xapi_undef.h" cleaned up any remaining macro defs after all other headers were used. This commit will guarantee that macro defs from the setup of one header (say, bli_oapi_ex.h) don't "infect" the definitions made in a subsequent header. As with this previous commit, this change does not fix any issue but rather attempts to avoid creating orphaned macro definitions that are only needed within a very limited scope. - Removed minimal #undef from bli_?api_[ba|ex].h. - Removed old commented-out lines from bli_?api_[ba|ex].h.

Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.

- Updated distro to 20.04 focal aarch64-gcc-10. This is minimal version required by aarch64-gcc-10. SVE intrinsics would not compile without GCC >=10. - x86 toolchains use official repo instead of ubuntu-toolchain-r/test. 20.04 focal is not supported by that PPA at the moment. - Add extra configuration-time options to .travis.yml. - Add Arm SVE entry to .travis.yml.

- ArmSVE don't test gemmt (seems Qemu-only problem); - Clang use TravisCI-provided version instead of fixing to clang-8 due to that clang-8 seems conflicting with TravisCI's clang-7.

- Removed `V=1` in make line - Removed `CFLAGS` in configure line - Restored `pwd` surrounding OOT line

AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.

Remove `rm-dupls` function in common.mk.

Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes #463.

Upgrade Travis CI for Arm SVE

Add explicit compiler check for Windows.

Details: - Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx entry within Performance.md, and also updated the experiment details accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2 experiments reflected in this commit. - In Performance.md, added an English translation of the project name under which the Fugaku results were gathered, courtesy of RuQing Xu.

Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework.

Apple+Arm64 requires additional "tagging" of local symbols.

- x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, #8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.

Roughly the same as 916e1fa , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.

Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.

Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging.

Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].

FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead.

Arm SVE CGEMM / ZGEMM Natural Kernels

ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9

Details: - The current mechanism for growing a pool_t doubles the length of the block_ptrs array every time the array length needs to be increased due to new blocks being added. However, that logic did not take in account the new total number of blocks, and the fact that the caller may be requesting more blocks that would fit even after doubling the current length of block_ptrs. The code comments now contain two illustrating examples that show why, even after doubling, we must always have at least enough room to fit all of the old blocks plus the newly requested blocks. - This commit also happens to fix a memory corruption issue that stems from growing any pool_t that is initialized with a block_ptrs length of 0. (Previously, the memory pool for packed buffers of C was initialized with a block_ptrs length of 0, but because it is unused this bug did not manifest by default.) - Co-authored-by: Minh Quan Ho <[email protected]>

Details: - Previously, the block_ptrs field of the pool_t was allowed to be initialized as any unsigned integer, including 0. However, a length of 0 could be problematic given that malloc(0) is undefined and therefore variable across implementations. As a safety measure, we check for block_ptrs array lengths of 0 and, in that case, increase them to 1. - Co-authored-by: Minh Quan Ho <[email protected]>

Details: - Fixed a bug that broke the use of 1m for dcomplex when the single- precision real and double-precision real ukernels had opposing I/O preferences (row-preferential sgemm ukernel + column-preferential dgemm ukernel, or vice versa). The fix involved adjusting the API to bli_cntx_set_ind_blkszs() so that the induced method context init function (e.g., bli_cntx_init_<subconfig>_ind()) could call that function for only one datatype at a time. This allowed the blocksize scaling (which varies depending on whether we're doing 1m_r or 1m_c) to happen on a per-datatype basis. This fixes issue #557. Thanks to Devin Matthews and RuQing Xu for helping discover and report this bug. - The aforementioned 1m fix required moving the 1m_r/1m_c logic from bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is called from each level-3 _front() function. The pack_t schemas in the cntx_t were also removed entirely, along with the associated accessor functions. This in turn required updating the trsm1m-related virtual ukernels to read the pack schema for B from the auxinfo_t struct rather than the context. This also required slight tweaks to bli_gemm_md.c. - Repositioned the logic for transposing the operation to accommodate the microkernel IO preference. This mostly only affects gemm. Thanks to Devin Matthews for his help with this. - Updated dpackm pack ukernels in the 'armsve' kernel set to avoid querying pack_t schemas from the context. - Removed the num_t dt argument from the ind_cntx_init_ft type defined in bli_gks.c. The context initialization functions for induced methods were previously passed a dt argument, but I can no longer figure out *why* they were passed this value. To reduce confusion, I've removed the dt argument (including also from the function defintion + prototype). - Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This breaks high-leve implementations of 3m and 4m, but this is okay since those implementations will be removed very soon. - Removed some older blocks of preprocessor-disabled code. - Comment update to test_libblis.c.

Details: - Fixed a bug in configure related to the building of the so-called config list. When processing the contents of config_registry, configure creates a series of structures and list that allow for various mappings related to configuration families, subconfigs, and kernel sets. Two of those lists are built via subsitituion of umbrella families with their subconfig members, and one of those lists was improperly performing the subtitution in a way that would erroneously match on partial umbrella family names. That code was changed to match the code that was already doing the subtitution properly, via substitute_words(). - Added comments noting the importance of using substitute_words() in both instances.

Details: - Skip testing on piledriver, steamroller, and excavator platforms in travis/do_sde.sh.

Details: - Updated travis/do_sde.sh so that the script downloads the SDE tarball from a new ci-utils repository on GitHub rather than from Intel's website. This change is being made in an attempt to circumvent Travis CI's recent troubles with downloading the SDE from Intel's website via curl. Thanks to Devin Matthews for suggesting the idea.

Details: - Removed support for all induced methods except for 1m. This included removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any code that existed only to support those implementations. These implementations were rarely used and posed code maintenance challenges for BLIS's maintainers going forward. - Removed reference kernels for packm that pack 3m and 4m micropanels, and removed 3m/4m-related code from bli_cntx_ref.c. - Removed support for 3m/4m from the code in frame/ind, then reorganized and streamlined the remaining code in that directory. The *ind(), *nat(), and *1m() APIs were all removed. (These additional API layers no longer made as much sense with only one induced method (1m) being supported.) The bli_ind.c file (and header) were moved to frame/base and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to frame/3. - Removed 3m/4m support from the code in frame/1m/packm. - Removed 3m/4m support from trmm/trsm macrokernels and simplified some pointer arithmetic that was previously expressed in terms of the bli_ptr_inc_by_frac() static inline function (whose definition was also removed). - Removed the following subdirectories of level-0 macro headers from frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros defined in these directories were used exclusively for 3m and 4m method codes. - Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in light of 1m being the only induced method left within BLIS. - Removed dt_on_output field within auxinfo_t and its associated accessor functions. - Re-indexed the 1e/1r pack schemas after removing those associated with variants of the 3m and 4m methods. This leaves two bits unused within the pack format portion of the schema bitfield. (See bli_type_defs.h for more info.) - Spun off the basic and expert interfaces to the object and typed APIs into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c and bli_l3_tapi_ex.c. - Moved the level-3 operation-specific _check function calls from the operations' _front() functions to the corresponding _ex() function of the object API. (This change roughly maintains where the _check() functions are called in the call stack but lays the groundwork for future changes that may come to the level-3 object APIs.) Minor modifications to bli_l3_check.c to allow the check() functions to be called from the expert interface APIs. - Removed support within the testsuite for testing the aforementioned induced methods, and updated the standalone test drivers in the 'test' directory so reflect the retirement of those induced methods. - Modified the sandbox contract so that the user is obliged to define bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light of the *nat() functions no longer existing.) Also updated the existing 'power10' and 'gemmlike' sandboxes to come into compliance with the new sandbox rules. - Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation to reflect the retirement of 3m/4m, and also modified Sandboxes.md to bring the document into alignment with new conventions. - Updated various comments; removed segments of commented-out code.

@ArcadioN09

Details: - Fixed a bug that broke certain mixed-datatype gemm behavior. This bug was introduced recently in e9da642 when the code that performs the operation transposition (for microkernel IO preference purposes) was moved up so that it occurred sooner. However, when I moved that code, I failed to notice that there was a cpp-protected "if" conditional that applied to the entire code block that was moved. Once the code block was relocated, the orphaned if-statement was now (erroneously) glomming on to the next thing that happened to be in the function, which happened to be the call to bli_rntm_set_ways_for_op(), causing a rather odd memory exhaustion error in the sba due to the num_threads field of the rntm_t still being -1 (because the rntm_t field were never processed as they should have been). Thanks to @ArcadioN09 (Snehith) for reporting this error and helpfully including relevant memory trace output.

Details: - Renamed herk macrokernels and supporting files and functions to gemmt, which is possible since at the macrokernel level they are identical. Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal functions rather than cpp macros that instantiate multiple functions. Thanks to Devin Matthews for his efforts on this issue (#531). - Check that the maximum stack buffer size is sufficiently large relative to the register blocksizes for each datatype, and do so when the context is initialized rather than when an operation is called. Note that with this change, users who pass in their own contexts into the expert interfaces currently will *not* have any checks performed. Thanks to Devin Matthews for suggesting this change.

Details: - Expanded the BLAS compatibility layer to include support for ?axpby_() and ?gemm_batch_(). The former is a straightforward BLAS-like interface into the axpbyv operation while the latter implements a batched gemm via loops over bli_?gemm(). Also expanded the CBLAS compatibility layer to include support for cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari for submitting these new APIs via #566. - Fixed a long-standing bug in common.mk that for some reason never manifested until now. Previously, CBLAS source files were compiled *without* the location of cblas.h being specified via a -I flag. I'm not sure why this worked, but it may be due to the fact that the cblas.h file resided in the same directory as all of the CBLAS source, and perhaps compilers implicitly add a -I flag for the directory that corresponds to the location of the source file being compiled. This bug only showed up because some CBLAS-like source code was moved into an 'extra' subdirectory of that frame/compat/cblas/src directory. After moving the code, compilation for those files failed (because the cblas.h header file, presumably, could not be found in the same location). This bug was fixed within common.mk by explicitly adding the cblas.h directory to the list of -I flags passed to the compiler. - Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory, and updated test/Makefile to build those drivers. - Fixed typo in error message string in cblas_sgemm.c.

Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files.

Details: - Inserted a new 'Example Code' section into the README.md immediately after the 'Getting Started' section. Thanks to Devin Matthews for recommending this addition. - Moved the 'Performance' section of the README down slightly so that it appears after the 'Documentation' section.

Details: - Annotated the code blocks that represent shell commands and output as 'bash' in README.md and BuildSystem.md.

Details: - Reverted the annotation of some markdown code blocks with 'bash' after realizing that the in-browser syntax highlighting was not worthwhile.

Details: - Added a new 'zen3' subconfiguration targeting support for the AMD Zen3 microarchitecture (#561). Thanks to AMD for this contribution. - Restructured clang and AOCC support for zen, zen2, and zen3 make_defs.mk files. The clang and AOCC version detection now happens in configure, not in the subconfigurations' makefile fragments. That is, we've added logic to configure that detects the version of clang/AOCC, outputs an appropriate variable to config.mk (ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the makefile fragment (as is currently done for the GCC_OT_* variables). - Added configure support for a GCC_OT_10_1_0 variable (and associated substitution anchor) to communicate whether the gcc version is older than 10.1.0, and use this variable to check for recent enough versions of gcc to use -march=znver3 in the zen3 subconfig. - Inlined the contents of config/zen/amd_config.mk into the zen and zen2 make_defs.mk so that the files are self-contained, harmonizing the format of all three Zen-based subconfigurations' make_defs.mk files. - Added indenting (with spaces) of GNU make conditionals for easier reading in zen, zen2, and zen3 make_defs.mk files. - Adjusted the range of models checked by bli_cpuid_is_zen() (which was previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is completely disjoint from the models checked by bli_cpuid_is_zen2() (0x30 ~ 0xff). This is normally necessary because Zen and Zen2 microarchitectures share the same family (23, or 0x17), and so the model code is the only way to differentiate the two. But in our case, fixing the model range for zen *wasn't* actually necessary since we checked for zen2 first, and therefore the wide zen range acted like the 'else' of an 'if-else' statement. That said, the change helps improve clarity for the reader by encoding useful knowledge, which was obtained from https://en.wikichip.org/wiki/amd/cpuid . - Added zen2.def and zen3.def files to the collection in travis/cpuid. Note that support for zen, zen2, and zen3 is now present, and while all the three microarchitectures have identical instruction sets from the perspective of BLIS microkernels, they each correspond to different subconfigurations and therefore merit separate testing. Thanks to Devin Matthews for his guidance in hacking these files as slight modifications of zen.def. - Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh. Now, zen, zen2, and zen3 are tested through the SDE via Travis CI builds. - Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils repository on GitHub rather than on Intel's website. This change was made in an attempt to circumvent recent troubles with Travis CI not being able to download the SDE directly from Intel's website via curl. Thanks to Devin Matthews for suggesting the idea. - Updated travis/do_sde.sh to grab the latest version (8.69.1) of the Intel SDE from the flame/ci-utils repository. - Updated .travis.yml to use gcc 9. The file was previously using gcc 8, which did not support -march=znver2. - Created amd64_legacy umbrella family in config_registry for targeting older (bulldozer, piledriver, steamroller, and excavator) microarchitectures and moved those same subconfigs out of the amd64 umbrella family. However, x86_64 retains amd64_legacy as a constituent member. - Fixed a bug in configure related to the building of the so-called config list. When processing the contents of config_registry, configure creates a series of structures and lists that allow for various mappings related to configuration families, subconfigs, and kernel sets. Two of those lists are built via substitution of umbrella families with their subconfig members, and one of those lists was improperly performing the substitution in a way that would erroneously match on partial umbrella family names. That code was changed to match the code that was already doing the substitution properly, via substitute_words(). Also added comments noting the importance of using substitute_words() in both instances. - Comment updates.

Details: - Replaced the hard-coded calls to double-precision real syr, syr2, syrk, and syrk in the corresponding standalone test drivers in the 'test' directory with conditional branches that will call the appropriate BLAS interface depending on which datatype is enabled. Thanks to Madan mohan Manokar for this improvement. - CREDITS file update.

Details: - Add a blurb about the new addons feature to the "Documentation for BLIS developers" section of the README.md, which also links to the Addons.md document.

Details: - Add additional mentions of addons to README.md, including in the "What's New" section. - Removed mention of sandboxes from the long list of advantages provided by BLIS. - Very minor description update to opening line of Addons.md.

Details: - Added a recursive sed script to the 'build' directory.

Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t*. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions.

fgvanzee and others added 30 commits March 27, 2021 15:15

Merge branch 'master' into dev

f9ad55c

Allow clang for ThunderX2 config

6548ceb

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

Fixed typo in cpp guard in bli_util_ft.h.

5aa63cd

Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be.

Adjust TravisCI

bd156a2

- ArmSVE don't test gemmt (seems Qemu-only problem); - Clang use TravisCI-provided version instead of fixing to clang-8 due to that clang-8 seems conflicting with TravisCI's clang-7.

Travis CI Revert Unnecessary Extras from 91d3636

932dfe6

- Removed `V=1` in make line - Removed `CFLAGS` in configure line - Restored `pwd` surrounding OOT line

Remove rm-dupls function in common.mk.

859fb77

AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.

Merge pull request #502 from flame/rm-rm-dupls

6d4ab02

Remove `rm-dupls` function in common.mk.

Add explicit compiler check for Windows.

5feb04e

Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes #463.

Merge pull request #500 from xrq-phys/armsve+travis

cbd8d39

Upgrade Travis CI for Arm SVE

Merge pull request #503 from flame/windows-compiler-check

e5c85da

Add explicit compiler check for Windows.

Asm Flag Mingling for Darwin_Aarch64

7fabd89

Apple+Arm64 requires additional "tagging" of local symbols.

Armv8A Rename Regs for Clang Compile: FP32 Part

9f4a4a3

Roughly the same as 916e1fa , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.

Armv8A Rename Regs for Safe Darwin Compile

5fc93e2

Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.

Add vhsubpd/vhsubpd.

7c3eb44

Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].

xrq-phys and others added 27 commits October 8, 2021 12:13

Arm SVE Config armsve Use ZGEMM/CGEMM

4b648e4

Arm SVE C/ZGEMM Support *beta==0

1749dfa

SH Kernel Unused Eigher

82b6128

Arm SVE C/ZGEMM Fix FMOV 0 Mistake

ccf1628

FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead.

Merge pull request #542 from xrq-phys/armsve-zgemm

408906f

Arm SVE CGEMM / ZGEMM Natural Kernels

Merge pull request #543 from xrq-phys/armsve-packm-fix

32a6d93

ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9

Disable SDE testing of old AMD microarchitectures.

290ff4b

Details: - Skip testing on piledriver, steamroller, and excavator platforms in travis/do_sde.sh.

Marked some markdown shell code blocks as 'bash'.

cbc88fe

Details: - Annotated the code blocks that represent shell commands and output as 'bash' in README.md and BuildSystem.md.

Reverted cbc88fe.

74c0c62

Details: - Reverted the annotation of some markdown code blocks with 'bash' after realizing that the in-browser syntax highlighting was not worthwhile.

Merge branch 'dev'

b727645

Brief mention/link to Addons.md in README.md.

a4bc03b

Details: - Add a blurb about the new addons feature to the "Documentation for BLIS developers" section of the README.md, which also links to the Addons.md document.

Minor updates to README.md, docs/Addons.md.

12c66a4

Details: - Add additional mentions of addons to README.md, including in the "What's New" section. - Removed mention of sandboxes from the long list of advantages provided by BLIS. - Very minor description update to opening line of Addons.md.

Added recu-sed.sh script to 'build' directory.

e229e04

Details: - Added a recursive sed script to the 'build' directory.

dzambare deleted the branch amd:integration_dipal December 3, 2021 10:45

dzambare closed this Dec 3, 2021

floatingCatty mentioned this pull request Mar 1, 2025

Memory management when using amd blis with block2 #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test PR #4

Test PR #4

Uh oh!

dzambare commented Nov 22, 2021

Uh oh!

Uh oh!

Test PR #4

Test PR #4

Uh oh!

Conversation

dzambare commented Nov 22, 2021

Uh oh!

Uh oh!