Skip to content

OpenCL based ACC-backend and SMM library #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Feb 2, 2021
Merged

Conversation

hfp
Copy link
Member

@hfp hfp commented Dec 3, 2020

This PR just misses the build integration into DBCSR (CMake, etc.), which can be followup PR (any help appreciated). Transpose and SMM both work for general matrices, and the code is tested based on stand-alone reproducers.

@hfp hfp marked this pull request as draft December 3, 2020 16:06
@codecov
Copy link

codecov bot commented Dec 3, 2020

Codecov Report

Merging #406 (711d289) into develop (21dae0f) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           develop    #406   +/-   ##
=======================================
  Coverage     63.1%   63.1%           
=======================================
  Files           86      86           
  Lines        25625   25625           
=======================================
  Hits         16190   16190           
  Misses        9435    9435           
Flag Coverage Δ
unittests 63.1% <ø> (ø)
with-blas 63.1% <ø> (ø)
with-libxsmm 63.2% <ø> (ø)
with-mpi 63.6% <ø> (ø)
with-openmp 62.3% <ø> (ø)
without-mpi 59.4% <ø> (ø)
without-openmp 62.3% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21dae0f...711d289. Read the comment docs.

@hfp hfp marked this pull request as ready for review December 3, 2020 17:17
@hfp
Copy link
Member Author

hfp commented Dec 3, 2020

The build system may pull-in kernel source into the transpose and smm locations, which allows to stay independent of search paths otherwise needed for the separate kernel sources (which keeps the executable literally portable). The process of embedding the sources requires some minor processing to bring text files into format of string literals.

Kernel sources are templated such that all cases are handled. There is no limit currently implemented, but such arguments are part of the ACC/LIBSMM interface already (and marked unused in the current implementation). Similarly, processing heterogeneous stacks requires a certain result code such that DBCSR picks up the stack with more general code. This is similar to the CUDA/HIP backend. Btw, certain values are currently hard-coded across backends and might be covered by ACC/LIBSMM interface in the future (number of parameters per stack aka 3 on device side and 7 on host-side as well the return code to reject heterogeneous stacks, etc.).

@alazzaro alazzaro self-assigned this Dec 4, 2020
@hfp
Copy link
Member Author

hfp commented Dec 7, 2020

Now this PR builds with CMake. This work is in the middle of passing DBCSR tests (a great bunch does not pass at the moment).

@hfp
Copy link
Member Author

hfp commented Dec 7, 2020

Specifically, tests/dbcsr_unittest2 passes single-threaded, tests/dbcsr_unittest4 passes in general, but tests/dbcsr_unittest1 and tests/dbcsr_unittest3 are failing.

@dev-zero
Copy link
Contributor

dev-zero commented Dec 9, 2020

This PR just misses the build integration into DBCSR (CMake, etc.), which can be followup PR (any help appreciated).

sorry about dropping the ball here, will take a look at it again next week (please ping me should I forget)

@hfp
Copy link
Member Author

hfp commented Dec 9, 2020

sorry about dropping the ball here, will take a look at it again next week (please ping me should I forget)

No worries, you already helped with your review. I can take some more time to get all tests running. Though, this is the boring/time-consuming part trying to find where things break. However, it looks good already; I can see it running on my integrated GPU.

@alazzaro alazzaro added this to the v2.2 milestone Dec 9, 2020
@alazzaro
Copy link
Member

alazzaro commented Dec 9, 2020

@hfp Thanks a lot for this PR! This work will definitely go for v2.2. I will give a look and comment next week.

I have a general comment on the reason for having the OpenCL backend.
Andreas did a preliminary OpenCL implementation years ago. At that time, the idea was to have OpenCL for the GPUs, but then we realized that CUDA was the best for Nvidia (no surprise) and the OpenCL implementation became useless (we remove it during the transition from SVN to GIT). For the general case, now we support HIP. Shoshana did nice work so that CUDA and HIP can share almost the entire code. In this way, we can cover the NVIDIA and AMD GPUs (and whatever HIP can support).

Now, I wonder which target do you have in mind for the OpenCL backend. Is this for the Intel GPU or FPGA?
And what about your OpenMP backend (#260)?

Sorry for these questions... OpenCL is good to have it for sure, but we should avoid having too many backends to support...

@hfp
Copy link
Member Author

hfp commented Dec 9, 2020

The OpenCL backend is meant for upcoming Intel GPUs. I could have done this with DPC++ and I may try this as we go forward. Meanwhile, I believe the OpenCL backend can be useful for FPGA/GPU targets in general (though, Intel FPGAs also allow for DPC++). Anyway, OpenCL is an industry standard and in case of Intel the runtime shares the bits with our Level-0 runtime (https://github.com/intel/compute-runtime). I consider this hence slightly more low-level than DPC++ and wanted to try this first. Regarding non-Intel, this backend should work fine on AMD and we can try/compare when ready.

@alazzaro
Copy link
Member

alazzaro commented Dec 9, 2020

OK, thanks for the explanation and I agree with you we should have it for v2.2.
Next question then: how we do optimize the kernels, autotuning in the same way we do for CUDA/HIP?

@hfp
Copy link
Member Author

hfp commented Dec 9, 2020

And what about your OpenMP backend

The OpenMP backed was meant to be a proof of concept and it was found not suitable for implementing the ACC interface (at least not until a "stream"-clause is in reach). Generally, an API driven approach (as implemented with the ACC interface) is somewhat against a directive based approach (pragmas are naturally scoped with the code side). I think additional/future acceleration in DBCSR can chose the easiest approach whether it is API- or directive based especially if the ACC interface would require to grow. Though, there was no point touching the highly sophisticated Cannon implementation and it is way easier to write a backend.

@alazzaro
Copy link
Member

alazzaro commented Dec 9, 2020

And what about your OpenMP backend

The OpenMP backed was meant to be a proof of concept and it was found not suitable for implementing the ACC interface (at least not until a "stream"-clause is in reach). Generally, an API driven approach (as implemented with the ACC interface) is somewhat against a directive based approach (pragmas are naturally scoped with the code side). I think additional/future acceleration in DBCSR can chose the easiest approach whether it is API- or directive based especially if the ACC interface would require to grow. Though, there was no point touching the highly sophisticated Cannon implementation and it is way easier to write a backend.

I definitely agree... the two approaches are somehow orthogonal and I liked a lot your attempt to add ACC-OpenMP... The situation may change in the future, but I agree with your comment.

@hfp
Copy link
Member Author

hfp commented Dec 9, 2020

how we do optimize the kernels, autotuning in the same way we do for CUDA/HIP?

I am deferring the auto-tuning for the time being (this PR). I was looking at the CUDA/HIP approach but found I had to rewrite the benchmark code etc. (that code is unfortunately written using CUDA/HIP for no real reason; it barely uses anything beyond the ACC interface). For auto-tuning OpenCL kernels, I would step back to a fixed set of triplets. My plan is to use the acc_bench_smm and acc_bench_trans drivers for tuning with perhaps OpenTuner. I have good experience with the latter and it is typically a few lines of code (see here or here). OpenTuner does a decent job wrt Hyperparameter tuning and it can avoid the old way of exhaustive exploration. Btw, especially predictive modeling with the CUDA/HIP auto-tuning approach is rather expensive/complex given the advantage over a fixed set of triplets. For this backend, the OpenTuner approach would simply run the afore mentioned drivers and set some environment variables ("tunables") embedded into the backend (e.g., ACC_OPENCL_TRANS_WGSIZE, ACC_OPENCL_TRANS_INPLACE just to speak about transpose; I am aware transpose is just B/W bound and there is not much to tune). The SMM code can easily expose "tunables" going forward with the kernel implementation.

@alazzaro
Copy link
Member

alazzaro commented Dec 9, 2020

OK Hans, thanks for the other clarification. I agree with the idea of having a static optimizer, i.e. no JIT kernel. We can always think to have a "default" kernel for other kernels or even a fall-back to OpenCL BLAS. Great!

@alazzaro
Copy link
Member

alazzaro commented Dec 9, 2020

One more question, this is for @dev-zero too: How do we test the OpenCL backend on the way forward?

@hfp
Copy link
Member Author

hfp commented Dec 9, 2020

Well, this backend uses JIT code generation already and the tuning approach would also load a set of optimized parameters (perhaps even GPU specific). The JIT code gen. uses poor men templates by passing some '-DSomething' with OpenCL's build-line (and the kernel is potentially written based on "Something"). Effectively, there is no difference when compared to C++ templates and loading high-level C++ code into CUDA-RT.

@dev-zero
Copy link
Contributor

dev-zero commented Dec 9, 2020

One more question, this is for @dev-zero too: How do we test the OpenCL backend on the way forward?

Well, an OCI-compatible image with all the required runtime would be nice and if we can run on a CPU as a fallback for the OpenCL-part it becomes a lot easier. If that is indeed possible but only with an Intel CPU we can use our tcopt4-tcopt8 machines as a test runner.

@hfp
Copy link
Member Author

hfp commented Dec 9, 2020

Wrt tests, perhaps the HIP image is a good choice? It should carry the OpenCL runtime as well/already.

@dev-zero
Copy link
Contributor

dev-zero commented Dec 9, 2020

sure, but we should find a way to get a test which is also executed to be able to establish a baseline

@hfp
Copy link
Member Author

hfp commented Dec 14, 2020

DBCSR's unit tests are passing. I have enabled DP and SP for the OpenCL backend (CUDA/HIP only implements DP).

@alazzaro
Copy link
Member

DBCSR's unit tests are passing. I have enabled DP and SP for the OpenCL backend (CUDA/HIP only implements DP).

This is very, very, relevant... There was a request on SP, and the reply was "we don't support"
Let me review and we can merge whenever you are ready...

@hfp
Copy link
Member Author

hfp commented Jan 22, 2021

I am going to merge this PR when tests are passing. I added Daint-CI tests for OpenCL, but disabled actual tests (just the "build" test). All runtime tests pass locally when working on Daint with similar settings like in CI-scripts. Also, Makefile based build (acc_bench_trans and acc_bench_smm) passes for OpenCL and CUDA backend (described here). Missing bits are only incremental improvements (kernels) and documentation.

@alazzaro
Copy link
Member

Hans, this is great! Give me few days to review what you did before merging... I assume @dev-zero will take a look as well...

@hfp
Copy link
Member Author

hfp commented Jan 22, 2021

ACK

@hfp
Copy link
Member Author

hfp commented Jan 23, 2021

Prior to documentation update, here is a quick hint on how to test-drive auto-tuning:

cd dbcsr/src/acc/opencl
make [DBG=0]
cd smm
./tune_multiply.sh  100  1 1  23, 4 9

All arguments for tune_multiply.sh are optional, but the default set of triplet is already ~1400 kernels. The 100 given above is the number of seconds spent per kernel for auto-tuning; typically 300 should be aimed for (5 minutes per kernel). Later implementations may use a more informed value than just a fixed duration per kernel. The second group of arguments is about work-split (number of partitions/nodes, and second the actual part-number). The script displays an estimate for tuning prior to the whole process. The last group of command line arguments (23, 4 9) is an example of a triplet specification; each comma-separated group is expanded as series of triplets using the Cartesian product (23, 4 9 generates 9 triplets).

An auto-tuning session (regardless of using the wrapper like tune_multiply.sh or the ordinary tuning script tune_multiply.py, produce a (series of) JSON file(s), which are summarized into a CSV file. After auto-tuning, simply rebuild:

cd dbcsr/src/acc/opencl
make [DBG=0]

Now, the backend, or to be more accurate - the LIBSMM library embedded the CSV file-content into the binary/executable. However, an environment variable OPENCL_LIBSMM_SMM_PARAMS allows to control adoption of tuned parameters (OPENCL_LIBSMM_SMM_PARAMS=0 disables using tuned parameters, and OPENCL_LIBSMM_SMM_PARAMS=/path/to/my.csv allows to supply a file and thereby overriding the parameters/default embedded into the binary). Both tuning parameters for SP and DP can be mixed and embedded or given per single file.

The auto-tuning simply (re-)uses the benchmarks and parses console output like timing and data-type, etc. (benchmarks do not know about auto-tuning and tuning does not know about timing, etc.).

hfp added 3 commits January 26, 2021 22:13
…enCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM.
@hfp
Copy link
Member Author

hfp commented Jan 26, 2021

Documentation for the OpenCL backend as well as for the OpenCL-based LIBSMM was incorporated into Ford documentation. Existing documentation is adjusted to accommodate/distinct the OpenCL backend and OpenCL based LIBSMM (in addition to CUDA/HIP based backend/libsmm_acc).

@hfp
Copy link
Member Author

hfp commented Jan 28, 2021

( I am happy to merge #419 if merged earlier in develop. )

@alazzaro alazzaro merged commit ba7f143 into cp2k:develop Feb 2, 2021
@alazzaro
Copy link
Member

alazzaro commented Feb 2, 2021

Thanks Hans!

@hfp
Copy link
Member Author

hfp commented Feb 2, 2021

Awesome, it made it -- Thank you! I am looking forward to perhaps seeing others who can try this. I also look forward to improve the kernels. ATM kernels are relatively simple (although, they look cluttered like my code in general aka preprocessor defs). It would be in fact interesting to tune a limited variety of kernels on AMD GPUs and to see what the gap is compared to the CUDA/HIP backend. Unfortunately, I only have a Vega 56 inside of a macOS based platform (not exactly representative also because of missing DP).

@alazzaro
Copy link
Member

alazzaro commented Feb 2, 2021

This is definitely on my plan... But first I have to make CP2K based on the DBCSR-cmake compilation and drop the makefiles inside DBCSR... I have a draft for that...

@hfp
Copy link
Member Author

hfp commented Feb 2, 2021

Also note, there is a new test for Daint-CI. I disabled runtime tests, but believe there is only a minor issue. The build test however is fully enabled and I adjusted things for USE_ACCEL=opencl to build on Daint and to pick-up Nvidia's OpenCL implementation.

@hfp hfp deleted the openclacc branch March 1, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants