Llama 4 Scout hybrid quant experiment #13040
Replies: 3 comments 2 replies
-
Take two on Llama 4 Scout quant experiment. I increase sophistication of layer quant control allowing both quant and output type to be specified per layer. First try I just speced quant but output type was frozen at Q3_K_M. With new approach I get true Q3_K_M and Q2_K_M on the layers: Code:
Description of quants : Q3_24_Q2_24_K_M_3_5 : Q3_K_M layers 0..23, Q2_K_M layers 24..47, embedding Q3_K_M output Q5_K_M Results :
Conclusion: |
Beta Was this translation helpful? Give feedback.
-
Take three on Llama 4 Scout quant experiment. In this test I cover all quants from Q3_K_S to Q6_K:
Description: Comparison:
This quant is optimized for my 3x4070 RPC setup to barely offload to the 3 (I kept adding Q3_K_M layers until I railed the VRAM with NGL=32). It can be improved by backing more Q3_K_Ms into the second block. However I find the performance extremely EDIT: I tweaked layer quants and got better results, uploaded final version I call Q4_K_H to HF here: https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF |
Beta Was this translation helpful? Give feedback.
-
Read up Unsloth's dynamic quants... |
Beta Was this translation helpful? Give feedback.
-
I spent some time working on quantizing Llama 4 Scout. This was my first experience trying to get a model to work at quants below 4b. Due to 108B model size it was necessary to go for Q2 or Q3 for this model to get usable TG on my setup (I current have 3 4070s on a local LAN).
SUBJECTIVE RESULTS :
Q2_K : unusable. The model intermittently generated strange looking artifacts/nonsense words in the text. I tried stock Q2_K and also Q2_K with Q5_K embeddings and output. Both generated the intermittent nonsense words.
IQ3_XS : unusable, resulted in super slow generation around 5t/s even with speculation. It did not show artifacts or nonsense words.
Q3_K_S : unusable. Chinese characters appear where numbers should be in answering numeric questions.
Q3_K_M : excellent. No artifacts/nonsense words, model very smart and knowledgeable on prompts. It feels subjectively like the model woke up from a haze when moving to this quant. I backed off output layer to Q5_K from Q6_K to save a small amount of space. The disdavantage was only 31 out of 49 layers could offload to my 3 4070s.
Hybrid quant experiment: I modified llama-quant.cpp to force odd layers to a specified quant so I could quant to Q3_K_M on even layers and Q2_K on odd layers as follows:
I called this quant Q3_Q2_K_M_3_5. I used Q3_K embed and Q5_K output. It generated a 48G vs a 51G gguf for Q3_K_M.
Q3_Q2_K_M_3_5 : very good. No artifacts or strange words noticed, still smart in answering all prompts, boosted prompt processing +10t/s and token gen +1 t/s over Q3_K_M (3 more layers offloaded). Subjectively on creative tasks I think it generated less rich output though (asked it to write some scifi story in styles of PKD, Niven, Van Vogt and Q3_K_M seemed noticeably richer than Q3_Q2_K_M_3_5.
Summary table:
I am using Llama 3.2 1B as a speculator with dynamic translation between vocabs of Scout and Llama 3.1 1B together with my custom speculation code.
Q3_K_M_3_5:
PN=426 PP=44.26287104350641 TG=8.743014182872457 DN=468 DA=270
Q3_Q2_K_M_3_5:
PN=418 PP=56.93405375998026 TG=9.34140627496675 DN=498 DA=252
PN= generated tokens
PP = prompt processing
TG = token gen
DN = drafted tokens
DA = accepted draft tokens
Conclusion: For my setup I think Q3_K_M_3_5 is the best. Token gen is not that much faster with Q3_Q2_K_M hybrid quant and perplexity took a pretty big percent hit and creative writing is not as good with the hybrid, but it does not generate artifacts or chinese characters for numbers so it would be usable.
Beta Was this translation helpful? Give feedback.
All reactions