-
Notifications
You must be signed in to change notification settings - Fork 48
OpenCL based ACC-backend and SMM library #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #406 +/- ##
=======================================
Coverage 63.1% 63.1%
=======================================
Files 86 86
Lines 25625 25625
=======================================
Hits 16190 16190
Misses 9435 9435
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
The build system may pull-in kernel source into the transpose and smm locations, which allows to stay independent of search paths otherwise needed for the separate kernel sources (which keeps the executable literally portable). The process of embedding the sources requires some minor processing to bring text files into format of string literals. Kernel sources are templated such that all cases are handled. There is no limit currently implemented, but such arguments are part of the ACC/LIBSMM interface already (and marked unused in the current implementation). Similarly, processing heterogeneous stacks requires a certain result code such that DBCSR picks up the stack with more general code. This is similar to the CUDA/HIP backend. Btw, certain values are currently hard-coded across backends and might be covered by ACC/LIBSMM interface in the future (number of parameters per stack aka 3 on device side and 7 on host-side as well the return code to reject heterogeneous stacks, etc.). |
Now this PR builds with CMake. This work is in the middle of passing DBCSR tests (a great bunch does not pass at the moment). |
Specifically, |
sorry about dropping the ball here, will take a look at it again next week (please ping me should I forget) |
No worries, you already helped with your review. I can take some more time to get all tests running. Though, this is the boring/time-consuming part trying to find where things break. However, it looks good already; I can see it running on my integrated GPU. |
@hfp Thanks a lot for this PR! This work will definitely go for v2.2. I will give a look and comment next week. I have a general comment on the reason for having the OpenCL backend. Now, I wonder which target do you have in mind for the OpenCL backend. Is this for the Intel GPU or FPGA? Sorry for these questions... OpenCL is good to have it for sure, but we should avoid having too many backends to support... |
The OpenCL backend is meant for upcoming Intel GPUs. I could have done this with DPC++ and I may try this as we go forward. Meanwhile, I believe the OpenCL backend can be useful for FPGA/GPU targets in general (though, Intel FPGAs also allow for DPC++). Anyway, OpenCL is an industry standard and in case of Intel the runtime shares the bits with our Level-0 runtime (https://github.com/intel/compute-runtime). I consider this hence slightly more low-level than DPC++ and wanted to try this first. Regarding non-Intel, this backend should work fine on AMD and we can try/compare when ready. |
OK, thanks for the explanation and I agree with you we should have it for v2.2. |
The OpenMP backed was meant to be a proof of concept and it was found not suitable for implementing the ACC interface (at least not until a "stream"-clause is in reach). Generally, an API driven approach (as implemented with the ACC interface) is somewhat against a directive based approach (pragmas are naturally scoped with the code side). I think additional/future acceleration in DBCSR can chose the easiest approach whether it is API- or directive based especially if the ACC interface would require to grow. Though, there was no point touching the highly sophisticated Cannon implementation and it is way easier to write a backend. |
I definitely agree... the two approaches are somehow orthogonal and I liked a lot your attempt to add ACC-OpenMP... The situation may change in the future, but I agree with your comment. |
I am deferring the auto-tuning for the time being (this PR). I was looking at the CUDA/HIP approach but found I had to rewrite the benchmark code etc. (that code is unfortunately written using CUDA/HIP for no real reason; it barely uses anything beyond the ACC interface). For auto-tuning OpenCL kernels, I would step back to a fixed set of triplets. My plan is to use the |
OK Hans, thanks for the other clarification. I agree with the idea of having a static optimizer, i.e. no JIT kernel. We can always think to have a "default" kernel for other kernels or even a fall-back to OpenCL BLAS. Great! |
One more question, this is for @dev-zero too: How do we test the OpenCL backend on the way forward? |
Well, this backend uses JIT code generation already and the tuning approach would also load a set of optimized parameters (perhaps even GPU specific). The JIT code gen. uses poor men templates by passing some '-DSomething' with OpenCL's build-line (and the kernel is potentially written based on "Something"). Effectively, there is no difference when compared to C++ templates and loading high-level C++ code into CUDA-RT. |
Well, an OCI-compatible image with all the required runtime would be nice and if we can run on a CPU as a fallback for the OpenCL-part it becomes a lot easier. If that is indeed possible but only with an Intel CPU we can use our tcopt4-tcopt8 machines as a test runner. |
Wrt tests, perhaps the HIP image is a good choice? It should carry the OpenCL runtime as well/already. |
sure, but we should find a way to get a test which is also executed to be able to establish a baseline |
DBCSR's unit tests are passing. I have enabled DP and SP for the OpenCL backend (CUDA/HIP only implements DP). |
This is very, very, relevant... There was a request on SP, and the reply was "we don't support" |
…pose kernel. * Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M. * Sanitize command line arguments similar to acc_bench_smm. * Folded inplace-transpose into general transpose.cl.
I am going to merge this PR when tests are passing. I added Daint-CI tests for OpenCL, but disabled actual tests (just the "build" test). All runtime tests pass locally when working on Daint with similar settings like in CI-scripts. Also, Makefile based build ( |
Hans, this is great! Give me few days to review what you did before merging... I assume @dev-zero will take a look as well... |
ACK |
Prior to documentation update, here is a quick hint on how to test-drive auto-tuning: cd dbcsr/src/acc/opencl
make [DBG=0]
cd smm
./tune_multiply.sh 100 1 1 23, 4 9 All arguments for tune_multiply.sh are optional, but the default set of triplet is already ~1400 kernels. The An auto-tuning session (regardless of using the wrapper like cd dbcsr/src/acc/opencl
make [DBG=0] Now, the backend, or to be more accurate - the LIBSMM library embedded the CSV file-content into the binary/executable. However, an environment variable The auto-tuning simply (re-)uses the benchmarks and parses console output like timing and data-type, etc. (benchmarks do not know about auto-tuning and tuning does not know about timing, etc.). |
…enCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM.
Documentation for the OpenCL backend as well as for the OpenCL-based LIBSMM was incorporated into Ford documentation. Existing documentation is adjusted to accommodate/distinct the OpenCL backend and OpenCL based LIBSMM (in addition to CUDA/HIP based backend/libsmm_acc). |
( I am happy to merge #419 if merged earlier in develop. ) |
Thanks Hans! |
Awesome, it made it -- Thank you! I am looking forward to perhaps seeing others who can try this. I also look forward to improve the kernels. ATM kernels are relatively simple (although, they look cluttered like my code in general aka preprocessor defs). It would be in fact interesting to tune a limited variety of kernels on AMD GPUs and to see what the gap is compared to the CUDA/HIP backend. Unfortunately, I only have a Vega 56 inside of a macOS based platform (not exactly representative also because of missing DP). |
This is definitely on my plan... But first I have to make CP2K based on the DBCSR-cmake compilation and drop the makefiles inside DBCSR... I have a draft for that... |
Also note, there is a new test for Daint-CI. I disabled runtime tests, but believe there is only a minor issue. The build test however is fully enabled and I adjusted things for |
This PR just misses the build integration into DBCSR (CMake, etc.), which can be followup PR (any help appreciated). Transpose and SMM both work for general matrices, and the code is tested based on stand-alone reproducers.