options for more aggressive inlining

Failing to inline a function within a GPU kernel can have very bad consequences.  I had a very hard time coming up with a good MWE of the types of issues I was commonly experiencing, because when I go to write a simple MWE it is able to inline, however it's easy to show the consequences.
```julia
using KernelAbstractions, CUDA
using BenchmarkTools


function moveto(device::Backend, A::AbstractArray)
    Ad = allocate(device, eltype(A), size(A)...)
    copyto!(Ad, A)
end

# or @noinline
@inline innerfunc1(A, idx) = A[idx]^2 + 1

@kernel function _kernf1!(B::AbstractArray, @Const(A::AbstractArray))
    j = @index(Global)
    idx = CartesianIndices(size(B))[j]
    B[idx] = innerfunc1(A, idx)
    nothing
end

function f1!(B::AbstractArray, A::AbstractArray)
    _kernf1!(get_backend(B))(B, A, ndrange=length(B))
    B
end

function main(n::Integer=10^6; device::Backend=CPU())
    B = moveto(device, zeros(Float32, 4, n))
    A = moveto(device, ones(Float32, 4, n))
    @btime CUDA.@sync f1!($B, $A)
end
```
On CPU I get
```julia
# @inline
  775.743 μs (304 allocations: 21.41 KiB)

# @noinline
  926.466 μs (304 allocations: 21.41 KiB)
```
while on GPU (nvidia RTX 4090) I get
```julia
# @inline
  20.518 μs (55 allocations: 1.34 KiB)

# @noinline
  100.539 μs (55 allocations: 1.34 KiB)
```
So in this simple example, on CPU it costs you 20%, but on GPU it is a factor of 5!  In my anecdotal experience the consequences in real code can be even worse, I have seen a factor of 10 loss a number of times (though I can't guarantee it was only from a single inline being missed).

Currently it is necessary to use *a lot* of `@inline` annotations to prevent this from happening unexpectedly.  I personally would very much like it if there were some sort of `always_inline` option for `@kernel` since in the overwhelming majority of use cases this is what I want.

Note that @maleadt mentioned on slack that `@cuda` already has an `always_inline`.

I realize this is perhaps an issue more appropriate for GPUCompiler.jl, but I opened it here since I would really like for KA to expose such an option if GPUCompiler had it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

options for more aggressive inlining #628

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

options for more aggressive inlining #628

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions