Skip to content

Commit e1a39a6

Browse files
committed
Update doc. Release 2023.1
1 parent 986493e commit e1a39a6

File tree

7 files changed

+29
-15
lines changed

7 files changed

+29
-15
lines changed

CHANGELOG.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1-
Version 2023.1 (2023-XX-XX)
1+
Version 2023.1 (2023-01-19)
22
-----------------------------
3+
* VkFFT 1.2.33, now using Rader algorithm for better performance
4+
with many non-radix sizes.
35
* Fix R2C tests when using numpy (scipy unavailable) [#19]
46
* Add support for F-ordered arrays (C2C and R2C)
57
* Allow selection of backend for non-systematic pvkfft-test
@@ -12,6 +14,10 @@ Version 2023.1 (2023-XX-XX)
1214
(from @isuruf, https://github.com/vincefn/pyvkfft/pull/17)
1315
* Fix simple fft interface import when only pycuda is used
1416
* Add cuda_driver_version, cuda_compile_version, cuda_runtime_version
17+
functions.
18+
* Add simpler interface to run benchmarks, using separate processes.
19+
* add pyvkfft-test-suite for long tests (up to 30 hours) for validation
20+
before new releases.
1521

1622
Version 2022.1.1 (2022-02-14)
1723
-----------------------------

README.rst

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ Requirements:
2828
- ``pycuda`` or ``cupy`` and CUDA developments tools (`nvcc`) for the cuda backend
2929
- ``numpy``
3030
- on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation,
31-
with either CUDA_PATH or CUDA_HOME environment variable.
31+
with either CUDA_PATH or CUDA_HOME environment variable. However it should be
32+
simpler to install using ``conda``, as detailed below
3233
- *Only when installing from source*: ``vkfft.h`` installed in the usual include
3334
directories, or in the 'src' directory
3435

@@ -105,8 +106,8 @@ Features
105106
- unit tests for all transforms: see test sub-directory. Note that these take a **long**
106107
time to finish due to the exhaustive number of sub-tests.
107108
- Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2
108-
- tested on macOS (10.13.6), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10
109-
(Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
109+
- tested on macOS (10.13.6/x86, 12.6/M1), Linux (Debian/Ubuntu, x86-64 and power9),
110+
and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
110111
- GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs.
111112
- inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x
112113
size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case
@@ -131,9 +132,9 @@ Performance
131132
See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare
132133
with cuFFT (using scikit-cuda) and clFFT (using gpyfft).
133134

134-
Example result for batched 2D FFT with array dimensions of batch x N x N using a Titan V:
135+
Example result for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100:
135136

136-
.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-TITAN_V-Linux.png
137+
.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_V100-Linux.png
137138

138139
Notes regarding this plot:
139140

@@ -143,23 +144,29 @@ Notes regarding this plot:
143144
* the batch size is adapted for each N so the transform takes long enough, in practice the
144145
transformed array is at around 600MB. Transforms on small arrays with small batch sizes
145146
could produce smaller performances, or better ones when fully cached.
146-
* a number of blue + (CuFFT) are actually performed as radix-N transforms with 7<N<127 (e.g. 11)
147-
-hence the performance similar to the blue dots- but the list of supported radix transforms
148-
is undocumented (?) so they are not correctly labeled.
147+
* The dots which are labelled as using a Blustein algorithm can also be using a Rader one,
148+
hence the better performance of many sizes, both for vkFFT and cuFFT
149149

150150
The general results are:
151151

152152
* vkFFT throughput is similar to cuFFT up to N=1024. For N>1024 vkFFT is much more
153153
efficient than cuFFT due to the smaller number of read and write per FFT axis
154154
(apart from isolated radix-2 3 sizes)
155155
* the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges
156-
where CUDA performs better, due to different cache . [Note that if the card is also used for display,
156+
where CUDA performs better, due to different cache. [Note that if the card is also used for display,
157157
then difference can increase, e.g. for nvidia cards opencl performance is more affected
158158
when being used for display than the cuda backend]
159159
* clFFT (via gpyfft) generally performs much worse than the other transforms, though this was
160160
tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations
161161
to find the fastest combination)
162162

163+
Another example on an A40 card (only with radix<=13 transforms):
164+
165+
.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_A40-Linux-radix13.png
166+
167+
On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms
168+
supported by vkFFT give globally better results.
169+
163170
Accuracy
164171
--------
165172
See the accuracy notebook, which allows to compare the accuracy for different
Loading
Loading

pyvkfft/benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -477,7 +477,8 @@ def run(nmin, nmax, radix_max, ndim, precision="single", nb_repeat=3, gpu_name=N
477477
Run the benchmark, measuring the idealised memory throughput (assuming a single
478478
read+write operation per axis) for an inplace C2C transform using different
479479
fft backends available.
480-
480+
Note that each test is made in a separate individual process, so this can
481+
take a long time.
481482
482483
:param nmin: smallest size N of the array, e.g. with a shape (batch, N, N)
483484
for a 2D transform.

pyvkfft/version.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
__authors__ = ["Vincent Favre-Nicolin (pyvkfft), Dmitrii Tolmachev (VkFFT)"]
44
__license__ = "MIT"
5-
__date__ = "2022/02/14"
5+
__date__ = "2023/01/19"
66
# Valid numbering includes 3.1, 3.1.0, 3.1.2, 3.1dev0, 3.1a0, 3.1b0
7-
__version__ = "2022.1.1"
7+
__version__ = "2023.1"
88

99

1010
def vkfft_version():

setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,11 +173,11 @@ def run(self):
173173

174174
for k, v in os.environ.items():
175175
if "VKFFT_BACKEND" in k:
176-
# Kludge to manually select vkfft backends. useful e.g. if nvidia tools
176+
# Environment variable to manually select vkfft backends. useful e.g. if nvidia tools
177177
# are installed but not functional
178178
# e.g. use:
179-
# VKFFT_BACKEND=cuda,opencl python setup.py install
180179
# VKFFT_BACKEND=opencl pip install pyvkfft
180+
# VKFFT_BACKEND=cuda pip install .
181181
if 'opencl' not in v.lower():
182182
exclude_packages.append('opencl')
183183
if 'cuda' not in v.lower():

0 commit comments

Comments
 (0)