Initial changes: Refactor Attention #2156

Itssshikhar · 2025-03-22T19:50:04Z

Hey @danielhanchen !! I wanted to take a stab at refactoring attention from the puzzles themselves.

Initially, every model is using its own implementation of attention and calls it directly. I took some reference from vLLM's unified attention package that simply uses a global_attention_variable to keep track of the current attention_module that is being used.

Just wanted to run it through you and see if this is good enough to proceed with implementing other attention_modules into a similar interface.

Thanks.

danielhanchen · 2025-03-25T11:47:11Z

Oh this is exactly what I was looking for! A unified attention calling mechanism would work wonders!

So technically in terms of perf, SDPA might technically be the fastest (confusingly enough) on new GPUs I think due to direct calling of cuDNN [TODO benchmarks]

'Xformers' I think is next in terms of speed - interestingly they also leverage FA2 and FA3 kernels and cuDNN I think.

'FA2 / FA3is technically next. However the good thing FA has is yes softcapping - although this is now not needed other than Gemma 2 (Gemma 3 removed it).FA` also has sequence packing, QKV packed (Xformers by default uses packed) etc.

Flex Attention is also good - the issue is sadly it's still slower (for now) than pure kernels - Flex was good say for jagged sequences / packing, but technically FA2 / FA3 and Xformers have support for it.

In general good first start!

zyklotomic · 2025-03-26T01:35:07Z

Hope I'm not hijacking too much, I attempted to get Flex Attention to have better performance, but to no avail in #1960, so it is a bit of a relief to hear that corroborated from @danielhanchen.

In case you do end up implementing a backend for Flex Attention, I wonder if it would be possible to integrate it with my attempt. I believe it does have its issues with making it part of a unified backend, namely that it doesn't support dynamic batch size and num heads. But that can be fixed on my end too.

I found it a very clean interface too!

Initial changes: Refactor Attention

5a7237a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Initial changes: Refactor Attention #2156

Initial changes: Refactor Attention #2156

Uh oh!

Itssshikhar commented Mar 22, 2025 •

edited

Loading

Uh oh!

danielhanchen commented Mar 25, 2025

Uh oh!

zyklotomic commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

Initial changes: Refactor Attention #2156

Are you sure you want to change the base?

Initial changes: Refactor Attention #2156

Uh oh!

Conversation

Itssshikhar commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Mar 25, 2025

Uh oh!

zyklotomic commented Mar 26, 2025

Uh oh!

Uh oh!

Itssshikhar commented Mar 22, 2025 •

edited

Loading