-
Notifications
You must be signed in to change notification settings - Fork 75
Description
Failing to inline a function within a GPU kernel can have very bad consequences. I had a very hard time coming up with a good MWE of the types of issues I was commonly experiencing, because when I go to write a simple MWE it is able to inline, however it's easy to show the consequences.
using KernelAbstractions, CUDA
using BenchmarkTools
function moveto(device::Backend, A::AbstractArray)
Ad = allocate(device, eltype(A), size(A)...)
copyto!(Ad, A)
end
# or @noinline
@inline innerfunc1(A, idx) = A[idx]^2 + 1
@kernel function _kernf1!(B::AbstractArray, @Const(A::AbstractArray))
j = @index(Global)
idx = CartesianIndices(size(B))[j]
B[idx] = innerfunc1(A, idx)
nothing
end
function f1!(B::AbstractArray, A::AbstractArray)
_kernf1!(get_backend(B))(B, A, ndrange=length(B))
B
end
function main(n::Integer=10^6; device::Backend=CPU())
B = moveto(device, zeros(Float32, 4, n))
A = moveto(device, ones(Float32, 4, n))
@btime CUDA.@sync f1!($B, $A)
end
On CPU I get
# @inline
775.743 μs (304 allocations: 21.41 KiB)
# @noinline
926.466 μs (304 allocations: 21.41 KiB)
while on GPU (nvidia RTX 4090) I get
# @inline
20.518 μs (55 allocations: 1.34 KiB)
# @noinline
100.539 μs (55 allocations: 1.34 KiB)
So in this simple example, on CPU it costs you 20%, but on GPU it is a factor of 5! In my anecdotal experience the consequences in real code can be even worse, I have seen a factor of 10 loss a number of times (though I can't guarantee it was only from a single inline being missed).
Currently it is necessary to use a lot of @inline
annotations to prevent this from happening unexpectedly. I personally would very much like it if there were some sort of always_inline
option for @kernel
since in the overwhelming majority of use cases this is what I want.
Note that @maleadt mentioned on slack that @cuda
already has an always_inline
.
I realize this is perhaps an issue more appropriate for GPUCompiler.jl, but I opened it here since I would really like for KA to expose such an option if GPUCompiler had it.