Add multi-latent attention, profiling instrumentation, other perf fixes #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds a version of multi-latent attention and profiling instrumentation. Also adds a CLI option
-L
to "pin" (lock) model weights in memory, which is needed to fix an issue on my test machine (AWS r6a.12xlarge) where pages of the export tensors kept getting evicted by the OS, causing severe performance issues due to thrash.MLA requires the model to be re-exported with
python convert.py --mla ...
. The engine will automatically use MLA when running a model exported with the option.Currently, MLA is slower than MHA on short-context generations (~2.6 tok/s vs ~4 tok/s for 128-token generations with negligible prompt length on DeepSeek-V3 quantized to Q2K). The model active bytes when ignoring KV cache is slightly higher at 16.29 GB for MLA vs 14.99GB for MHA, so a regression is not unexpected, but one of this size is surprising and indicates the effective bandwidth is lower.
Model active bytes including KV cache (of context size 4096) is much better at 16.58GB for MLA vs 39.55GB for MHA. I haven't yet tested the token throughput difference.