-
Notifications
You must be signed in to change notification settings - Fork 11.4k
llama-bench : Add --override-tensors
arg
#12922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Sketchy performance comparison on my laptop to show why My hardware is an ASUS TUF A14 gaming laptop, so a Ryzen 9 AI HX 370 with 7500MHz LPDDR5 and an RTX 4060 Mobile. I run it for these tests in the ASUS-standard "Turbo" mode. First, a CPU-only test on my hardware (used 0.3 GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf -t 8 -ngl 0 -p 4096 -n 4096
Next, running with .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf -t 8 -ngl 4 -p 4096 -n 4096
Next, enabling the .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf -t 8 -ngl 99 -ot "\d+\.ffn_.*exp.=CPU" -p 4096 -n 4096
Effects are significantly more pronounced in larger MoE models, especially with more experts and some experts that are re-used for every pass (e.g. Llama 4 Scout and Maverick, although those models are beyond my devices' capabilities.) I tried to demonstrate with Deepseek-V2-Lite, but ran into CUDA errors if I tried to apply flash attention, cache quantization, or override-tensors. I don't have the experience with llama.cpp's codebase to track those down, but another Beaver has suggested it may be related to #12798 |
--override-tensors
option to llama-bench--override-tensors
arg
PR #12891 has resolved my issue running flash attention and override-tensors with Deepseek-V2-Lite. Some performance numbers for that, same hardware as my last set: CPU Only (Used 0.8GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
-p 4096 -n 4096 -t 8 ^
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 0
Completely Filled GPU (Used 8.0GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
-p 4096 -n 4096 -t 8 ^
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 14
Comparable VRAM GPU (Used 2.8GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
-p 4096 -n 4096 -t 8 ^
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 4
Override-Tensors Run (Used 1.8GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
-p 4096 -n 4096 -t 8 ^
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "\d+\.ffn_.*exp.=CPU"
Tuned Override-Tensors (Used 6.3GB of VRAM during prompt processing) This run, I'm leaving 6 of the 26 layers' conditional experts on the GPU as well as all the .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
-p 4096 -n 4096 -t 8 ^
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "[12]\d\.ffn_.*exps.=CPU"
Turns out my GPU was far more underpowered than I expected, but y'all can see the point of being able to benchmark this kind of thing. |
Ran another set of experiments on another device (RTX 3070 and an AMD Ryzen 7 5800X 8-Core with two sticks of 2133MHz DDR4) CPU Only (Used 836MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
-p 4096 -n 4096 -t 4 \
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 0
Full GPU (Used 7626MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
-p 4096 -n 4096 -t 4 \
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 13
Comparable VRAM GPU (Used 2930MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
-p 4096 -n 4096 -t 4 \
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 4
Override-Tensors Full CPU Experts (except shared) (Used 2276MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
-p 4096 -n 4096 -t 4 \
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "\d+.ffn_.*exps.=CPU"
Override-Tensors Tuned (Used 7034MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
-p 4096 -n 4096 -t 4 \
-fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "[2.]\d.ffn_.*exps.=CPU"
Now, as the processor doesn't have AVX512 and relatively high bandwidth memory, we see the GPU eeking out a performance boost and override-tensors helping significantly. |
You can also use this to offload the entire KV cache to GPU while keeping the model on CPU: |
A small group over at BeaverAI have been making extensive use of the
--override-tensors
(-ot
) flag for running massive MOE models faster by keeping attention on the GPU and offloading the expert FFNs to the CPU. Informal experimentation inllama-server
orllama-cli
doesn't compare to the properllama-bench
, though, so this PR adds the--override-tensors
arg (and the-ot
short form) tollama-bench
.I noticed the
// FIXME
about leaking memory inargs.cpp
when copying the--override-tensors
argument parsing, and chose to stamp null terminators into theargv
, rather than accept the memory leak, asllama-bench
callsparse_cmd_params
only once. Let me know if you'd like that swapped out for the memory-leaking version from the common arg parser, as it's only a handful of user-entered bytes leaked.Also planning to do some documentation of
--override-tensors
a little later on, as it's proving very useful and we'd love to spread the word.