Skip to content

Add multi-latent attention, profiling instrumentation, other perf fixes #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 2, 2025

Conversation

andrewkchan
Copy link
Owner

@andrewkchan andrewkchan commented Apr 30, 2025

Adds a version of multi-latent attention and profiling instrumentation. Also adds a CLI option -L to "pin" (lock) model weights in memory, which is needed to fix an issue on my test machine (AWS r6a.12xlarge) where pages of the export tensors kept getting evicted by the OS, causing severe performance issues due to thrash.

MLA requires the model to be re-exported with python convert.py --mla .... The engine will automatically use MLA when running a model exported with the option.

Currently, MLA is slower than MHA on short-context generations (~2.6 tok/s vs ~4 tok/s for 128-token generations with negligible prompt length on DeepSeek-V3 quantized to Q2K). The model active bytes when ignoring KV cache is slightly higher at 16.29 GB for MLA vs 14.99GB for MHA, so a regression is not unexpected, but one of this size is surprising and indicates the effective bandwidth is lower.

Model active bytes including KV cache (of context size 4096) is much better at 16.58GB for MLA vs 39.55GB for MHA. I haven't yet tested the token throughput difference.

@andrewkchan andrewkchan changed the title Add multi-latent attention and profiling instrumentation Add multi-latent attention, profiling instrumentation, other perf fixes May 2, 2025
@andrewkchan andrewkchan merged commit 2c99d65 into main May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant