@@ -20,7 +20,7 @@ Requirements:
20
20
21
21
- ``vkfft.h `` installed in the usual include directories, or in the 'src' directory
22
22
- ``pyopencl `` and the opencl libraries/development tools for the opencl backend
23
- - ``pycuda `` or `cupy ` and CUDA developments tools (`nvcc `) for the cuda backend
23
+ - ``pycuda `` or `` cupy ` ` and CUDA developments tools (`nvcc `) for the cuda backend
24
24
- ``numpy ``
25
25
26
26
This package can be installed from source using ``python setup.py install `` or ``pip install . ``.
@@ -101,17 +101,17 @@ Notes regarding this plot:
101
101
transformed array is at around 600MB. Transforms on small arrays with small batch sizes
102
102
could produce smaller performances, or better ones when fully cached.
103
103
* a number of blue + (CuFFT) are actually performed as radix-N transforms with 7<N<127 (e.g. 11)
104
- -hence the performance similar to the blue dots- but the actual supported radix transforms
104
+ -hence the performance similar to the blue dots- but the list of supported radix transforms
105
105
is undocumented so they are not correctly labeled.
106
106
107
107
The general results are:
108
108
109
109
* vkFFT throughput is similar to cuFFT up to N=1024. For N>1024 vkFFT is much more
110
110
efficient than cuFFT due to the smaller number of read and write per FFT axis
111
111
(apart from isolated radix-2 3 sizes)
112
- * the OpenCL and CUDA backends of vkFFT perform similarly, as expected. [Note that this should
113
- be true * as long as the card is only used for computing *. If it is also used for display,
114
- then performance may be different , e.g. for nvidia cards opencl performance is more affected
112
+ * the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges
113
+ where CUDA performs better. [Note that if the card is also used for display,
114
+ then difference can increase , e.g. for nvidia cards opencl performance is more affected
115
115
when being used for display than the cuda backend]
116
116
* clFFT (via gpyfft) generally performs much worse than the other transforms, though this was
117
117
tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations
0 commit comments