Fused CPU attention kernels (~4x performance increase) #2973

EricLBuehler · 2025-05-29T01:50:42Z

This introduces fused CPU attention kernels for optimized CPU inference. This removes the necessity to materialize the attention matrices, thereby dramatically improving throughput.

On an M3 Max with Llama 3.2 3b at 4-bit quantization, I am measuring a 4x increase in decode T/s. This is faster than llama.cpp, even with llama.cpp CPU FlashAttention enabled.

These kernels are loosely based on the work in FlashAttention and CPU implementations in vLLM and ggml, but have been modified for higher performance.

Algorithm

`run_flash_attn_cpu`

Choose execution path
- Decode path: if the query length S_q == 1, invoke a specialized “single-Q” routine
- Batched path: otherwise, invoke the general batched attention routine
Compute attention
- Parallel setup
  - Uses a custom Rayon thread-pool (FLASH_ATTN_POOL) with macOS QoS hints
  - Installs the pool via FLASH_ATTN_POOL.install(...) to isolate flash-attention tasks
- Work distribution
  - Batched: flattens the output into chunks of size D and calls
```
out.par_chunks_mut(dv)
   .with_min_len(64)
   .enumerate()
   .for_each(...)
```
    to assign each (batch, head, query_pos) row to a Rayon worker
  - Decode: further splits the KV axis into cache-friendly tiles, then does
```
(0..kv_tiles)
  .into_par_iter()
  .map(...)     // per-tile map
  .reduce(...)  // numerically-stable softmax reduce
```
    achieving nested parallelism for long KV sequences
- Per-row computation
  1. Gather the query vector
  2. Loop over all key/value positions:
    - Apply mask and positional bias
    - Compute dot-product between query and key
    - Update an online softmax (log-sum-exp) in a streaming fashion
    - Weight and accumulate the value vectors
  3. Normalize the accumulated value sum by the softmax denominator
Assemble result
- Collect all per-row outputs into a flat buffer
- Reshape into the final tensor of shape (B, S_q, H, D)
- Return the result on the CPU device

EricLBuehler · 2025-05-29T11:40:45Z

@LaurentMazare could you please review this PR?

AlpineVibrations · 2025-05-30T17:13:06Z

should this help with quantized qwen3 on Mac with CPU?
Does it help on Mac M1 ?
thanks

EricLBuehler · 2025-05-30T17:25:19Z

This PR doesnt integrate it into any models yet; but that would be relatively easy.

Once that is done, yes. I saw ~4x T/s increase for CPU inference.

AlpineVibrations · 2025-05-30T17:26:17Z

ok. I see. so is there a way we can test it right now? or a sample on how you integrated it?
thanks

EricLBuehler · 2025-05-30T17:30:44Z

I didn't include it in this PR for ease of review, but you would replace the attention block of any model to call this function.

For Qwen 3:

candle/candle-transformers/src/models/qwen3.rs

Lines 222 to 228 in cd7b877

let scale = 1.0 / (self.head_dim as f64).sqrt();

let mut scores = (q.matmul(&k.transpose(2, 3)?)? * scale)?;

if let Some(m) = attn_mask {

scores = scores.broadcast_add(m)?;

}

let probs = candle_nn::ops::softmax_last_dim(&scores)?;

let ctx = probs.matmul(&v)?; // (B, H, L, D)

Note that qwen3's q/k/v shapes are (b, h, seq_len, d), but this kernel requires (b, seq_len, h, d). Therefore you need to transpose q/k/v with .transpose(1,2).

For a real-world use-case, I would explore the mistral.rs attention backend and dispatch code, and how it's used in a model (like Qwen 3).

AlpineVibrations · 2025-06-13T22:20:06Z

how's this looking. ? is it done? would be great to merge in if is done. and if its not done what is still needed on it?

EricLBuehler added 3 commits May 28, 2025 21:29

Add cpu flash attention

c7506cb

Add test

7f592ab

Format

2abb1fd

Fix docs shape

2174a67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fused CPU attention kernels (~4x performance increase) #2973

Fused CPU attention kernels (~4x performance increase) #2973

EricLBuehler commented May 29, 2025

Uh oh!

EricLBuehler commented May 29, 2025

Uh oh!

AlpineVibrations commented May 30, 2025

Uh oh!

EricLBuehler commented May 30, 2025

Uh oh!

AlpineVibrations commented May 30, 2025

Uh oh!

EricLBuehler commented May 30, 2025

Uh oh!

AlpineVibrations commented Jun 13, 2025

Uh oh!

Uh oh!

Fused CPU attention kernels (~4x performance increase) #2973

Are you sure you want to change the base?

Fused CPU attention kernels (~4x performance increase) #2973

Conversation

EricLBuehler commented May 29, 2025

run_flash_attn_cpu

Uh oh!

EricLBuehler commented May 29, 2025

Uh oh!

AlpineVibrations commented May 30, 2025

Uh oh!

EricLBuehler commented May 30, 2025

Uh oh!

AlpineVibrations commented May 30, 2025

Uh oh!

EricLBuehler commented May 30, 2025

Uh oh!

AlpineVibrations commented Jun 13, 2025

Uh oh!

Uh oh!

`run_flash_attn_cpu`