OpenCL based ACC-backend and SMM library #406

hfp · 2020-12-03T16:05:31Z

This PR just misses the build integration into DBCSR (CMake, etc.), which can be followup PR (any help appreciated). Transpose and SMM both work for general matrices, and the code is tested based on stand-alone reproducers.

codecov · 2020-12-03T17:02:32Z

Codecov Report

Merging #406 (711d289) into develop (21dae0f) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           develop    #406   +/-   ##
=======================================
  Coverage     63.1%   63.1%           
=======================================
  Files           86      86           
  Lines        25625   25625           
=======================================
  Hits         16190   16190           
  Misses        9435    9435

Flag	Coverage Δ
unittests	`63.1% <ø> (ø)`
with-blas	`63.1% <ø> (ø)`
with-libxsmm	`63.2% <ø> (ø)`
with-mpi	`63.6% <ø> (ø)`
with-openmp	`62.3% <ø> (ø)`
without-mpi	`59.4% <ø> (ø)`
without-openmp	`62.3% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21dae0f...711d289. Read the comment docs.

hfp · 2020-12-03T17:28:23Z

The build system may pull-in kernel source into the transpose and smm locations, which allows to stay independent of search paths otherwise needed for the separate kernel sources (which keeps the executable literally portable). The process of embedding the sources requires some minor processing to bring text files into format of string literals.

Kernel sources are templated such that all cases are handled. There is no limit currently implemented, but such arguments are part of the ACC/LIBSMM interface already (and marked unused in the current implementation). Similarly, processing heterogeneous stacks requires a certain result code such that DBCSR picks up the stack with more general code. This is similar to the CUDA/HIP backend. Btw, certain values are currently hard-coded across backends and might be covered by ACC/LIBSMM interface in the future (number of parameters per stack aka 3 on device side and 7 on host-side as well the return code to reject heterogeneous stacks, etc.).

hfp · 2020-12-07T12:49:56Z

Now this PR builds with CMake. This work is in the middle of passing DBCSR tests (a great bunch does not pass at the moment).

hfp · 2020-12-07T13:07:31Z

Specifically, tests/dbcsr_unittest2 passes single-threaded, tests/dbcsr_unittest4 passes in general, but tests/dbcsr_unittest1 and tests/dbcsr_unittest3 are failing.

CMakeLists.txt

src/CMakeLists.txt

dev-zero · 2020-12-09T08:07:19Z

This PR just misses the build integration into DBCSR (CMake, etc.), which can be followup PR (any help appreciated).

sorry about dropping the ball here, will take a look at it again next week (please ping me should I forget)

hfp · 2020-12-09T08:16:30Z

sorry about dropping the ball here, will take a look at it again next week (please ping me should I forget)

No worries, you already helped with your review. I can take some more time to get all tests running. Though, this is the boring/time-consuming part trying to find where things break. However, it looks good already; I can see it running on my integrated GPU.

alazzaro · 2020-12-09T09:39:54Z

@hfp Thanks a lot for this PR! This work will definitely go for v2.2. I will give a look and comment next week.

I have a general comment on the reason for having the OpenCL backend.
Andreas did a preliminary OpenCL implementation years ago. At that time, the idea was to have OpenCL for the GPUs, but then we realized that CUDA was the best for Nvidia (no surprise) and the OpenCL implementation became useless (we remove it during the transition from SVN to GIT). For the general case, now we support HIP. Shoshana did nice work so that CUDA and HIP can share almost the entire code. In this way, we can cover the NVIDIA and AMD GPUs (and whatever HIP can support).

Now, I wonder which target do you have in mind for the OpenCL backend. Is this for the Intel GPU or FPGA?
And what about your OpenMP backend (#260)?

Sorry for these questions... OpenCL is good to have it for sure, but we should avoid having too many backends to support...

hfp · 2020-12-09T09:54:57Z

The OpenCL backend is meant for upcoming Intel GPUs. I could have done this with DPC++ and I may try this as we go forward. Meanwhile, I believe the OpenCL backend can be useful for FPGA/GPU targets in general (though, Intel FPGAs also allow for DPC++). Anyway, OpenCL is an industry standard and in case of Intel the runtime shares the bits with our Level-0 runtime (https://github.com/intel/compute-runtime). I consider this hence slightly more low-level than DPC++ and wanted to try this first. Regarding non-Intel, this backend should work fine on AMD and we can try/compare when ready.

alazzaro · 2020-12-09T10:00:09Z

OK, thanks for the explanation and I agree with you we should have it for v2.2.
Next question then: how we do optimize the kernels, autotuning in the same way we do for CUDA/HIP?

hfp · 2020-12-09T10:05:02Z

And what about your OpenMP backend

The OpenMP backed was meant to be a proof of concept and it was found not suitable for implementing the ACC interface (at least not until a "stream"-clause is in reach). Generally, an API driven approach (as implemented with the ACC interface) is somewhat against a directive based approach (pragmas are naturally scoped with the code side). I think additional/future acceleration in DBCSR can chose the easiest approach whether it is API- or directive based especially if the ACC interface would require to grow. Though, there was no point touching the highly sophisticated Cannon implementation and it is way easier to write a backend.

alazzaro · 2020-12-09T10:08:20Z

And what about your OpenMP backend

The OpenMP backed was meant to be a proof of concept and it was found not suitable for implementing the ACC interface (at least not until a "stream"-clause is in reach). Generally, an API driven approach (as implemented with the ACC interface) is somewhat against a directive based approach (pragmas are naturally scoped with the code side). I think additional/future acceleration in DBCSR can chose the easiest approach whether it is API- or directive based especially if the ACC interface would require to grow. Though, there was no point touching the highly sophisticated Cannon implementation and it is way easier to write a backend.

I definitely agree... the two approaches are somehow orthogonal and I liked a lot your attempt to add ACC-OpenMP... The situation may change in the future, but I agree with your comment.

hfp · 2020-12-09T10:17:48Z

how we do optimize the kernels, autotuning in the same way we do for CUDA/HIP?

I am deferring the auto-tuning for the time being (this PR). I was looking at the CUDA/HIP approach but found I had to rewrite the benchmark code etc. (that code is unfortunately written using CUDA/HIP for no real reason; it barely uses anything beyond the ACC interface). For auto-tuning OpenCL kernels, I would step back to a fixed set of triplets. My plan is to use the acc_bench_smm and acc_bench_trans drivers for tuning with perhaps OpenTuner. I have good experience with the latter and it is typically a few lines of code (see here or here). OpenTuner does a decent job wrt Hyperparameter tuning and it can avoid the old way of exhaustive exploration. Btw, especially predictive modeling with the CUDA/HIP auto-tuning approach is rather expensive/complex given the advantage over a fixed set of triplets. For this backend, the OpenTuner approach would simply run the afore mentioned drivers and set some environment variables ("tunables") embedded into the backend (e.g., ACC_OPENCL_TRANS_WGSIZE, ACC_OPENCL_TRANS_INPLACE just to speak about transpose; I am aware transpose is just B/W bound and there is not much to tune). The SMM code can easily expose "tunables" going forward with the kernel implementation.

alazzaro · 2020-12-09T10:29:55Z

OK Hans, thanks for the other clarification. I agree with the idea of having a static optimizer, i.e. no JIT kernel. We can always think to have a "default" kernel for other kernels or even a fall-back to OpenCL BLAS. Great!

alazzaro · 2020-12-09T10:33:38Z

One more question, this is for @dev-zero too: How do we test the OpenCL backend on the way forward?

hfp · 2020-12-09T10:38:19Z

Well, this backend uses JIT code generation already and the tuning approach would also load a set of optimized parameters (perhaps even GPU specific). The JIT code gen. uses poor men templates by passing some '-DSomething' with OpenCL's build-line (and the kernel is potentially written based on "Something"). Effectively, there is no difference when compared to C++ templates and loading high-level C++ code into CUDA-RT.

dev-zero · 2020-12-09T10:38:45Z

One more question, this is for @dev-zero too: How do we test the OpenCL backend on the way forward?

Well, an OCI-compatible image with all the required runtime would be nice and if we can run on a CPU as a fallback for the OpenCL-part it becomes a lot easier. If that is indeed possible but only with an Intel CPU we can use our tcopt4-tcopt8 machines as a test runner.

hfp · 2020-12-09T10:39:46Z

Wrt tests, perhaps the HIP image is a good choice? It should carry the OpenCL runtime as well/already.

dev-zero · 2020-12-09T10:43:03Z

sure, but we should find a way to get a test which is also executed to be able to establish a baseline

hfp · 2020-12-14T14:22:54Z

DBCSR's unit tests are passing. I have enabled DP and SP for the OpenCL backend (CUDA/HIP only implements DP).

alazzaro · 2020-12-14T14:26:36Z

DBCSR's unit tests are passing. I have enabled DP and SP for the OpenCL backend (CUDA/HIP only implements DP).

This is very, very, relevant... There was a request on SP, and the reply was "we don't support"
Let me review and we can merge whenever you are ready...

…pose kernel. * Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M. * Sanitize command line arguments similar to acc_bench_smm. * Folded inplace-transpose into general transpose.cl.

…BCSR default).

hfp · 2021-01-22T15:11:05Z

I am going to merge this PR when tests are passing. I added Daint-CI tests for OpenCL, but disabled actual tests (just the "build" test). All runtime tests pass locally when working on Daint with similar settings like in CI-scripts. Also, Makefile based build (acc_bench_trans and acc_bench_smm) passes for OpenCL and CUDA backend (described here). Missing bits are only incremental improvements (kernels) and documentation.

alazzaro · 2021-01-22T15:13:44Z

Hans, this is great! Give me few days to review what you did before merging... I assume @dev-zero will take a look as well...

hfp · 2021-01-22T15:15:12Z

ACK

hfp · 2021-01-23T20:11:39Z

Prior to documentation update, here is a quick hint on how to test-drive auto-tuning:

cd dbcsr/src/acc/opencl
make [DBG=0]
cd smm
./tune_multiply.sh  100  1 1  23, 4 9

All arguments for tune_multiply.sh are optional, but the default set of triplet is already ~1400 kernels. The 100 given above is the number of seconds spent per kernel for auto-tuning; typically 300 should be aimed for (5 minutes per kernel). Later implementations may use a more informed value than just a fixed duration per kernel. The second group of arguments is about work-split (number of partitions/nodes, and second the actual part-number). The script displays an estimate for tuning prior to the whole process. The last group of command line arguments (23, 4 9) is an example of a triplet specification; each comma-separated group is expanded as series of triplets using the Cartesian product (23, 4 9 generates 9 triplets).

An auto-tuning session (regardless of using the wrapper like tune_multiply.sh or the ordinary tuning script tune_multiply.py, produce a (series of) JSON file(s), which are summarized into a CSV file. After auto-tuning, simply rebuild:

cd dbcsr/src/acc/opencl
make [DBG=0]

Now, the backend, or to be more accurate - the LIBSMM library embedded the CSV file-content into the binary/executable. However, an environment variable OPENCL_LIBSMM_SMM_PARAMS allows to control adoption of tuned parameters (OPENCL_LIBSMM_SMM_PARAMS=0 disables using tuned parameters, and OPENCL_LIBSMM_SMM_PARAMS=/path/to/my.csv allows to supply a file and thereby overriding the parameters/default embedded into the binary). Both tuning parameters for SP and DP can be mixed and embedded or given per single file.

The auto-tuning simply (re-)uses the benchmarks and parses console output like timing and data-type, etc. (benchmarks do not know about auto-tuning and tuning does not know about timing, etc.).

…enCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM.

hfp · 2021-01-26T21:24:13Z

Documentation for the OpenCL backend as well as for the OpenCL-based LIBSMM was incorporated into Ford documentation. Existing documentation is adjusted to accommodate/distinct the OpenCL backend and OpenCL based LIBSMM (in addition to CUDA/HIP based backend/libsmm_acc).

hfp · 2021-01-28T07:29:55Z

( I am happy to merge #419 if merged earlier in develop. )

alazzaro · 2021-02-02T08:22:31Z

Thanks Hans!

hfp · 2021-02-02T08:38:24Z

Awesome, it made it -- Thank you! I am looking forward to perhaps seeing others who can try this. I also look forward to improve the kernels. ATM kernels are relatively simple (although, they look cluttered like my code in general aka preprocessor defs). It would be in fact interesting to tune a limited variety of kernels on AMD GPUs and to see what the gap is compared to the CUDA/HIP backend. Unfortunately, I only have a Vega 56 inside of a macOS based platform (not exactly representative also because of missing DP).

alazzaro · 2021-02-02T08:43:20Z

This is definitely on my plan... But first I have to make CP2K based on the DBCSR-cmake compilation and drop the makefiles inside DBCSR... I have a draft for that...

hfp · 2021-02-02T09:05:57Z

Also note, there is a new test for Daint-CI. I disabled runtime tests, but believe there is only a minor issue. The build test however is fully enabled and I adjusted things for USE_ACCEL=opencl to build on Daint and to pick-up Nvidia's OpenCL implementation.

hfp marked this pull request as draft December 3, 2020 16:06

hfp marked this pull request as ready for review December 3, 2020 17:17

alazzaro self-assigned this Dec 4, 2020

hfp force-pushed the openclacc branch from 0b52b91 to 54dfda8 Compare December 7, 2020 12:47

dev-zero reviewed Dec 9, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

dev-zero reviewed Dec 9, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

dev-zero reviewed Dec 9, 2020

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

dev-zero reviewed Dec 9, 2020

View reviewed changes

src/CMakeLists.txt Outdated Show resolved Hide resolved

alazzaro added this to the v2.2 milestone Dec 9, 2020

hfp force-pushed the openclacc branch from 5b9fce7 to 0fa4776 Compare December 14, 2020 11:59

hfp added 9 commits January 21, 2021 21:12

Allow empty/no choice with respect to USE_ACCEL.

b355f03

Attempt to CI-test OpenCL backend and LIBSMM.

f7f2f2a

Adjusted CI/build setup: build LIBXSMM and help CMake to find OpenCL.

630432d

Extend PKG_CONFIG_PATH rather than overriding it.

24a8a79

Further adjusted build/run scripts (Daint-CI).

432ff63

One more attempt to get CI up and running.

dba923d

Disabled Daint-CI runtime tests (temporarily). Prepared revised trans…

4c45d7b

…pose kernel. * Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M. * Sanitize command line arguments similar to acc_bench_smm. * Folded inplace-transpose into general transpose.cl.

Improved finding OpenCL bits (e.g., on Daint).

911e8da

Fixed nasty typo. Adjusted default GPU to P100 (to better adhere to D…

61c12bb

…BCSR default).

hfp added 3 commits January 26, 2021 22:13

Improved build messages/help.

e7a141c

Adjusted installation instructions for clarity.

a2506c6

Adjusted existing documentation to better accommodate/distinct the Op…

cadbcae

…enCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM.

Documented auto-tuning.

a0c1dcf

hfp added 4 commits January 28, 2021 13:00

Improved console output (tune_multiply.sh).

e0c4b07

Note about opentuner.db directory. Some additional details and rephrase.

417cfb8

Adjusted separator (tune_multiply.sh).

1b5fd3b

Improved documentation with some sample output (auto-tuning).

711d289

alazzaro merged commit ba7f143 into cp2k:develop Feb 2, 2021

alazzaro mentioned this pull request Feb 2, 2021

Simplify the CMake ROCm detection #419

Merged

hfp deleted the openclacc branch March 1, 2021 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL based ACC-backend and SMM library #406

OpenCL based ACC-backend and SMM library #406

hfp commented Dec 3, 2020

codecov bot commented Dec 3, 2020 •

edited

Loading

hfp commented Dec 3, 2020

hfp commented Dec 7, 2020

hfp commented Dec 7, 2020

dev-zero commented Dec 9, 2020

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020 •

edited

Loading

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020

alazzaro commented Dec 9, 2020

hfp commented Dec 9, 2020

dev-zero commented Dec 9, 2020

hfp commented Dec 9, 2020

dev-zero commented Dec 9, 2020

hfp commented Dec 14, 2020

alazzaro commented Dec 14, 2020

hfp commented Jan 22, 2021

alazzaro commented Jan 22, 2021

hfp commented Jan 22, 2021

hfp commented Jan 23, 2021

hfp commented Jan 26, 2021 •

edited

Loading

hfp commented Jan 28, 2021

alazzaro commented Feb 2, 2021

hfp commented Feb 2, 2021

alazzaro commented Feb 2, 2021

hfp commented Feb 2, 2021

OpenCL based ACC-backend and SMM library #406

OpenCL based ACC-backend and SMM library #406

Conversation

hfp commented Dec 3, 2020

codecov bot commented Dec 3, 2020 • edited Loading

Codecov Report

hfp commented Dec 3, 2020

hfp commented Dec 7, 2020

hfp commented Dec 7, 2020

dev-zero commented Dec 9, 2020

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020 • edited Loading

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020

hfp commented Dec 9, 2020

alazzaro commented Dec 9, 2020

alazzaro commented Dec 9, 2020

hfp commented Dec 9, 2020

dev-zero commented Dec 9, 2020

hfp commented Dec 9, 2020

dev-zero commented Dec 9, 2020

hfp commented Dec 14, 2020

alazzaro commented Dec 14, 2020

hfp commented Jan 22, 2021

alazzaro commented Jan 22, 2021

hfp commented Jan 22, 2021

hfp commented Jan 23, 2021

hfp commented Jan 26, 2021 • edited Loading

hfp commented Jan 28, 2021

alazzaro commented Feb 2, 2021

hfp commented Feb 2, 2021

alazzaro commented Feb 2, 2021

hfp commented Feb 2, 2021

codecov bot commented Dec 3, 2020 •

edited

Loading

alazzaro commented Dec 9, 2020 •

edited

Loading

hfp commented Jan 26, 2021 •

edited

Loading