@@ -28,7 +28,8 @@ Requirements:
28
28
- ``pycuda `` or ``cupy `` and CUDA developments tools (`nvcc `) for the cuda backend
29
29
- ``numpy ``
30
30
- on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation,
31
- with either CUDA_PATH or CUDA_HOME environment variable.
31
+ with either CUDA_PATH or CUDA_HOME environment variable. However it should be
32
+ simpler to install using ``conda ``, as detailed below
32
33
- *Only when installing from source *: ``vkfft.h `` installed in the usual include
33
34
directories, or in the 'src' directory
34
35
@@ -105,8 +106,8 @@ Features
105
106
- unit tests for all transforms: see test sub-directory. Note that these take a **long **
106
107
time to finish due to the exhaustive number of sub-tests.
107
108
- Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2
108
- - tested on macOS (10.13.6), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10
109
- (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
109
+ - tested on macOS (10.13.6/x86, 12.6/M1 ), Linux (Debian/Ubuntu, x86-64 and power9),
110
+ and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
110
111
- GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs.
111
112
- inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x
112
113
size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case
@@ -131,9 +132,9 @@ Performance
131
132
See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare
132
133
with cuFFT (using scikit-cuda) and clFFT (using gpyfft).
133
134
134
- Example result for batched 2D FFT with array dimensions of batch x N x N using a Titan V :
135
+ Example result for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100 :
135
136
136
- .. image :: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-TITAN_V -Linux.png
137
+ .. image :: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_V100 -Linux.png
137
138
138
139
Notes regarding this plot:
139
140
@@ -143,23 +144,29 @@ Notes regarding this plot:
143
144
* the batch size is adapted for each N so the transform takes long enough, in practice the
144
145
transformed array is at around 600MB. Transforms on small arrays with small batch sizes
145
146
could produce smaller performances, or better ones when fully cached.
146
- * a number of blue + (CuFFT) are actually performed as radix-N transforms with 7<N<127 (e.g. 11)
147
- -hence the performance similar to the blue dots- but the list of supported radix transforms
148
- is undocumented (?) so they are not correctly labeled.
147
+ * The dots which are labelled as using a Blustein algorithm can also be using a Rader one,
148
+ hence the better performance of many sizes, both for vkFFT and cuFFT
149
149
150
150
The general results are:
151
151
152
152
* vkFFT throughput is similar to cuFFT up to N=1024. For N>1024 vkFFT is much more
153
153
efficient than cuFFT due to the smaller number of read and write per FFT axis
154
154
(apart from isolated radix-2 3 sizes)
155
155
* the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges
156
- where CUDA performs better, due to different cache . [Note that if the card is also used for display,
156
+ where CUDA performs better, due to different cache. [Note that if the card is also used for display,
157
157
then difference can increase, e.g. for nvidia cards opencl performance is more affected
158
158
when being used for display than the cuda backend]
159
159
* clFFT (via gpyfft) generally performs much worse than the other transforms, though this was
160
160
tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations
161
161
to find the fastest combination)
162
162
163
+ Another example on an A40 card (only with radix<=13 transforms):
164
+
165
+ .. image :: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_A40-Linux-radix13.png
166
+
167
+ On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms
168
+ supported by vkFFT give globally better results.
169
+
163
170
Accuracy
164
171
--------
165
172
See the accuracy notebook, which allows to compare the accuracy for different
0 commit comments