Add multi-latent attention, profiling instrumentation, other perf fixes #8

andrewkchan · 2025-04-30T00:24:21Z

Adds a version of multi-latent attention and profiling instrumentation. Also adds a CLI option -L to "pin" (lock) model weights in memory, which is needed to fix an issue on my test machine (AWS r6a.12xlarge) where pages of the export tensors kept getting evicted by the OS, causing severe performance issues due to thrash.

MLA requires the model to be re-exported with python convert.py --mla .... The engine will automatically use MLA when running a model exported with the option.

Currently, MLA is slower than MHA on short-context generations (~2.6 tok/s vs ~4 tok/s for 128-token generations with negligible prompt length on DeepSeek-V3 quantized to Q2K). The model active bytes when ignoring KV cache is slightly higher at 16.29 GB for MLA vs 14.99GB for MHA, so a regression is not unexpected, but one of this size is surprising and indicates the effective bandwidth is lower.

Model active bytes including KV cache (of context size 4096) is much better at 16.58GB for MLA vs 39.55GB for MHA. I haven't yet tested the token throughput difference.

andrewkchan added 16 commits April 24, 2025 07:29

add --mla arg to convert.py

f310783

support MLA in inference (q_lora_rank > 0)

9acb3de

convert.py fixes

a4b73b8

fixes for q2k+MLA

9f01b05

fix segfault

917535c

fix wov transpose in convert.py

1ee6d7b

fix attn_mla call

7937aec

add profiling instrumentation

6fd69bd

update Config::active_bytes

ec25557

unfold wo and wv_b

ec3b7a7

add ProfileScopes, use them to disaggregate passes

67c78a7

improve profiling instrumentation

ed77505

add -L arg to lock weight pages

4f3fc24

add compile-time switch to disable profiling

eba7a17

oops

5529c9f

don't eagerly mmap if -L not specified

ca523b5

andrewkchan changed the title ~~Add multi-latent attention and profiling instrumentation~~ Add multi-latent attention, profiling instrumentation, other perf fixes May 2, 2025

andrewkchan merged commit 2c99d65 into main May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-latent attention, profiling instrumentation, other perf fixes #8

Add multi-latent attention, profiling instrumentation, other perf fixes #8

andrewkchan commented Apr 30, 2025 •

edited

Loading

Add multi-latent attention, profiling instrumentation, other perf fixes #8

Add multi-latent attention, profiling instrumentation, other perf fixes #8

Conversation

andrewkchan commented Apr 30, 2025 • edited Loading

andrewkchan commented Apr 30, 2025 •

edited

Loading