Skip to content

Initial changes: Refactor Attention #2156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Itssshikhar
Copy link

@Itssshikhar Itssshikhar commented Mar 22, 2025

Hey @danielhanchen !! I wanted to take a stab at refactoring attention from the puzzles themselves.

Initially, every model is using its own implementation of attention and calls it directly. I took some reference from vLLM's unified attention package that simply uses a global_attention_variable to keep track of the current attention_module that is being used.

Just wanted to run it through you and see if this is good enough to proceed with implementing other attention_modules into a similar interface.

Thanks.

@danielhanchen
Copy link
Contributor

Oh this is exactly what I was looking for! A unified attention calling mechanism would work wonders!

So technically in terms of perf, SDPA might technically be the fastest (confusingly enough) on new GPUs I think due to direct calling of cuDNN [TODO benchmarks]

'Xformers' I think is next in terms of speed - interestingly they also leverage FA2 and FA3 kernels and cuDNN I think.

'FA2 / FA3is technically next. However the good thing FA has is yes softcapping - although this is now not needed other than Gemma 2 (Gemma 3 removed it).FA` also has sequence packing, QKV packed (Xformers by default uses packed) etc.

Flex Attention is also good - the issue is sadly it's still slower (for now) than pure kernels - Flex was good say for jagged sequences / packing, but technically FA2 / FA3 and Xformers have support for it.

In general good first start!

@zyklotomic
Copy link

Hope I'm not hijacking too much, I attempted to get Flex Attention to have better performance, but to no avail in #1960, so it is a bit of a relief to hear that corroborated from @danielhanchen.

In case you do end up implementing a backend for Flex Attention, I wonder if it would be possible to integrate it with my attempt. I believe it does have its issues with making it part of a unified backend, namely that it doesn't support dynamic batch size and num heads. But that can be fixed on my end too.

I found it a very clean interface too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants