Skip to content

options for more aggressive inlining #628

@ExpandingMan

Description

@ExpandingMan

Failing to inline a function within a GPU kernel can have very bad consequences. I had a very hard time coming up with a good MWE of the types of issues I was commonly experiencing, because when I go to write a simple MWE it is able to inline, however it's easy to show the consequences.

using KernelAbstractions, CUDA
using BenchmarkTools


function moveto(device::Backend, A::AbstractArray)
    Ad = allocate(device, eltype(A), size(A)...)
    copyto!(Ad, A)
end

# or @noinline
@inline innerfunc1(A, idx) = A[idx]^2 + 1

@kernel function _kernf1!(B::AbstractArray, @Const(A::AbstractArray))
    j = @index(Global)
    idx = CartesianIndices(size(B))[j]
    B[idx] = innerfunc1(A, idx)
    nothing
end

function f1!(B::AbstractArray, A::AbstractArray)
    _kernf1!(get_backend(B))(B, A, ndrange=length(B))
    B
end

function main(n::Integer=10^6; device::Backend=CPU())
    B = moveto(device, zeros(Float32, 4, n))
    A = moveto(device, ones(Float32, 4, n))
    @btime CUDA.@sync f1!($B, $A)
end

On CPU I get

# @inline
  775.743 μs (304 allocations: 21.41 KiB)

# @noinline
  926.466 μs (304 allocations: 21.41 KiB)

while on GPU (nvidia RTX 4090) I get

# @inline
  20.518 μs (55 allocations: 1.34 KiB)

# @noinline
  100.539 μs (55 allocations: 1.34 KiB)

So in this simple example, on CPU it costs you 20%, but on GPU it is a factor of 5! In my anecdotal experience the consequences in real code can be even worse, I have seen a factor of 10 loss a number of times (though I can't guarantee it was only from a single inline being missed).

Currently it is necessary to use a lot of @inline annotations to prevent this from happening unexpectedly. I personally would very much like it if there were some sort of always_inline option for @kernel since in the overwhelming majority of use cases this is what I want.

Note that @maleadt mentioned on slack that @cuda already has an always_inline.

I realize this is perhaps an issue more appropriate for GPUCompiler.jl, but I opened it here since I would really like for KA to expose such an option if GPUCompiler had it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions