-
Notifications
You must be signed in to change notification settings - Fork 11.4k
imatrix: add option to display importance score statistics for a given imatrix file #12718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Nice idea, seems like something we discuss the last time? @bartowski1182 Btw is it possible to show importance score from an existing imatrix file @EAddario ? |
Thank you @ngxson. Yes, it will process any imatrix file produced by llama-imatrix, but it is restricted to single file (does not deal with multiple --in-file) |
Isn't this just related to the hidden state norms getting larger as you move through the different layers? If so, then it won't really account for the accumulation of errors caused by an early layer on the final output? |
Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total. The logic llama-imatrix uses to calculate the IS is to square the value of the corresponding weight during inference, keep a running total of how many times that particular value has been updated, and then save the average when inference has finished. This only applies to 2d or larger tensors, so it will ignore norms (1d), but since errors influence which weights get updated (and how frequently), the IS does account for errors, albeit indirectly. Make sense? |
I think the mean squared activations (which would be their variance assuming a mean of 0) cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations. ( The goal here is to find which layers need more precision, right? I'm not sure if the mean squared activations really are what you're looking for. There might be other measures like skewness and kurtosis which may be useful. But I'm not sure if taking only the activations into account is the right way to get the insights you seek. What I'd like to try eventually would be to use a simultaneous quantization algorithm to try multiple bit-widths at once in a reasonable amount of time so that the errors can be compared per tensor to help with the choice of quantization type. This would be possible for I still think it can be useful to have some way to visualize what is in In the paper you link (https://arxiv.org/pdf/2406.17415), the closest thing to what you propose would be the LIM (layer input modification) score, which is calculated as follows (in Section 3.1), where
|
Very clear now, thanks @compilade. You're correct, I'm using the mean squared activation averaged to identify which tensors/layers produce large magnitude activations and I had a quick look at your PRs. I definitely like the idea of storing imatrix data in GGUF format and can appreciate how it would improve the generation of these types of stats. #12557 is quite intriguing, but truth be told I haven't had a chance to really digest it fully (there's a lot going on!) but would love to see it merged specially if it improves ternary quants |
Had a chance to think this more thoroughly and now get the implications of @jukofyork and @compilade's comments. Agree my current approach is not really identifying influence but rather score "growth". Back to the drawing table 😆 |
I can help you with this, but it will need a fair bit of compute to calculate. I've not got time to explain fully but basically:
You will likely have to transform the loss measure somehow:
Assuming Finite-Differences is too costly to perform, then then you can use a stochastic approximation (FDSA) or its extension SPSA to estimate the gradients using whatever compute you can muster up. |
I've edited the post above quite a lot so should hopefully make more sense (in case you're reading from the email notification). |
Thank you, now I know what I'm doing over the weekend 😁 On a serious note, much appreciated @jukofyork. Plenty of food for thought. I'll give it proper consideration |
No problem and just remember the most important thing to figure out is exactly what you are optimising first! There are actually a lot of compelling options for this; each with their own reasons for and against... All have different costs to compute too:
|
*WARNING*: This is mostly vibe code. Hope I'm not wasting y'alls time. Compute Layer Importance Modification (LIM) Scores The goal of this PR is to rank layers of a given tensor in order of sensitivity to quantization error. Given that it is now possible to use `llama-quantize --custom-q ...` regex, it may be possible to use these LIM Scores to decide which layers of a given tensor to quantize more or less in an attempt to preserve generation quality (e.g. low perplexity) while reducing memory footprint as compared to using same quant size across all layers of a given tensor. This experimental PR was motivated by this comment and PR: ggml-org/llama.cpp#12718 I may force-push this after more testing and experimenting to see if it is actually doing the right thing and if the output is actually useful to improve quantization quality e.g. PPL per GiB... This may just be a big mistake, lol. This is built on existing imatrix computation and assumes that values of `x[j]` are the "activations" coming right in/out of the given tensor layer. I don't know GGML and generally work in python or vanilla c not so much c++. So a lot of this was vibe coded running [ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4 quant](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ4_K_R4). So this is partially an experiment actually trying to use an LLM instead of just enjoying the meta of manual quantization min-maxing. ``` @misc{dumitru2024layerwisequantizationpragmaticeffective, title={Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels}, author={Razvan-Gabriel Dumitru and Vikas Yadav and Rishabh Maheshwary and Paul-Ioan Clotan and Sathwik Tejaswi Madhusudhan and Mihai Surdeanu}, year={2024}, eprint={2406.17415}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.17415}, code={https://github.com/RazvanDu/LayerwiseQuant/}, } ```
Following from @jukofyork and @compilade's remarks and suggestions, I've made some changes in my approach. To set the context, and explain exactly what's the problem I'm trying to solve, I have two objectives in mind:
The direct implication of constraint "2" is no changes to As noted by @compilade, IS "...cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations..." however, IS are a direct measurement of how active a particular weight was during inference, based on a given input prompt (more on this later), and therefore can be used as a (arguably suboptimal) proxy for "influence" but instead of relying on the average, a better metric is to use the sum of IS per tensor/layer (the higher the number, the "busier" the tensor/layer and the more it contributes to upstream computations). Although there are better metrics (e.g. gradient of loss, covariance, LIM, etc.), those would require changes to the imatrix collection process, which is beyond the scope of what I'm trying to do, at least for now. Having said that, it's worth keeping an eye on the work @ubergarm is doing in WIP Compute per layer LIM Scores during imatrix Tests performed during quantization of DeepSeek-R1-Distill-Qwen-7B seem to confirm that Σ(Bias), which is what I'm calling the sum of IS per tensor, is a good influence indicator as it can be seen in the table below, where (↑) represents quantizing half of the most influential tensors (as per Σ(Bias)) at a higher bit level, and (↓) represents quantizing half of the least influential tensors at a higher bit level:
For reference, compared to the naive Q4_K_M model, the layer-wised quantized is 10.7% smaller (4.68GB vs 4.18GB) with only a 0.35% penalty on μPPL:
Whilst I was considering @jukofyork's feedback, I came to think of how much the benefit of using an imatrix is dependent on the quality of the prompt used during its generation, and how difficult it's to determine how well a given prompt "exercises" all of the model's capabilities, so I added additional statistics to help in that regard. As things stand at the moment, Σ(Bias): the sum of all squared activations across the tensor (i.e. the Importance Scores) |
Thanks for the update and defining the statistics gleaned from an existing imatrix.dat file. I pulled your branch and gave it a try on LLaMA-2-13B to compare against the same model used in that compute imatrix and then show statisticsCompute imatrix$ git branch | grep '*'
* (HEAD detached at EAddario/imatrix)
$ git rev-parse --short HEAD
200d88c8
$ ./build/bin/llama-imatrix --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 5136 (200d88c8)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
$ ./build/bin/llama-imatrix \
--verbosity 1 \
-m /mnt/astrodata/llm/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q8_0.gguf \
-f wiki.test.raw \
-o imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat \
--ctx-size 512 \
--threads 16
...
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
...
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 397.256 ms
compute_imatrix: computing over 655 chunks with batch_size 512
compute_imatrix: 1.44 seconds per pass - ETA 15.73 minutes
[1]4.8087,[2]5.4272,[3]6.3040,[4]7.0129,[5]7.1984,[6]7.0947,[7]7.2490,[8]7.3314,[9]7.5682,
...
Final estimate: PPL = 6.5257 +/- 0.04210
save_imatrix: stored collected data after 655 chunks in imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat
llama_perf_context_print: load time = 22623.39 ms
llama_perf_context_print: prompt eval time = 861807.99 ms / 335360 tokens ( 2.57 ms per token, 389.14 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 891205.70 ms / 335361 tokens Show Statistics$ ./build/bin/llama-imatrix \
--in-file imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat \
--show-statistics
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
Computing statistics for imatrix-wiki-test-llama-2-13b-chat-Q8_0-gguf.dat (280 tensors)
Layer Tensor Σ(Bias) Min Max μ σ % Active N Entropy E (norm) ZD Score
==========================================================================================================================================================================
30 attn_q 1321.16 0.0000 22.1645 0.2580 0.5248 99.98% 5120 11.8988 96.57% 5.4688
30 attn_v 1321.16 0.0000 22.1645 0.2580 0.5248 99.98% 5120 11.8988 96.57% 5.4688
30 attn_k 1321.16 0.0000 22.1645 0.2580 0.5248 99.98% 5120 11.8988 96.57% 5.4688
39 ffn_down 1290.84 0.0042 29.1379 0.0934 0.4147 100.00% 13824 12.1372 88.24% 25.9693
32 attn_v 1285.53 0.0000 17.6335 0.2511 0.4668 99.98% 5120 11.9402 96.90% 5.4688
32 attn_k 1285.53 0.0000 17.6335 0.2511 0.4668 99.98% 5120 11.9402 96.90% 5.4688
32 attn_q 1285.53 0.0000 17.6335 0.2511 0.4668 99.98% 5120 11.9402 96.90% 5.4688
34 attn_q 1256.21 0.0000 14.0536 0.2454 0.4260 99.98% 5120 11.9679 97.13% 5.6641
34 attn_v 1256.21 0.0000 14.0536 0.2454 0.4260 99.98% 5120 11.9679 97.13% 5.6641
34 attn_k 1256.21 0.0000 14.0536 0.2454 0.4260 99.98% 5120 11.9679 97.13% 5.6641
29 attn_k 1204.44 0.0000 23.4754 0.2352 0.5280 99.98% 5120 11.8456 96.13% 5.4688
29 attn_v 1204.44 0.0000 23.4754 0.2352 0.5280 99.98% 5120 11.8456 96.13% 5.4688
29 attn_q 1204.44 0.0000 23.4754 0.2352 0.5280 99.98% 5120 11.8456 96.13% 5.4688
33 attn_q 1183.21 0.0000 14.3861 0.2311 0.3921 99.98% 5120 11.9785 97.21% 5.4688
33 attn_v 1183.21 0.0000 14.3861 0.2311 0.3921 99.98% 5120 11.9785 97.21% 5.4688
33 attn_k 1183.21 0.0000 14.3861 0.2311 0.3921 99.98% 5120 11.9785 97.21% 5.4688
31 attn_k 1182.86 0.0000 20.5292 0.2310 0.4778 99.98% 5120 11.8971 96.55% 5.4688
31 attn_v 1182.86 0.0000 20.5292 0.2310 0.4778 99.98% 5120 11.8971 96.55% 5.4688
31 attn_q 1182.86 0.0000 20.5292 0.2310 0.4778 99.98% 5120 11.8971 96.55% 5.4688
35 attn_k 1173.15 0.0000 12.3308 0.2291 0.3496 99.98% 5120 12.0212 97.56% 5.6641
35 attn_v 1173.15 0.0000 12.3308 0.2291 0.3496 99.98% 5120 12.0212 97.56% 5.6641
35 attn_q 1173.15 0.0000 12.3308 0.2291 0.3496 99.98% 5120 12.0212 97.56% 5.6641
28 attn_v 1161.62 0.0000 24.2086 0.2269 0.5975 99.98% 5120 11.7171 95.09% 5.6641
28 attn_q 1161.62 0.0000 24.2086 0.2269 0.5975 99.98% 5120 11.7171 95.09% 5.6641
28 attn_k 1161.62 0.0000 24.2086 0.2269 0.5975 99.98% 5120 11.7171 95.09% 5.6641
27 attn_q 1152.05 0.0000 21.7389 0.2250 0.5541 99.98% 5120 11.7706 95.53% 5.4688
27 attn_k 1152.05 0.0000 21.7389 0.2250 0.5541 99.98% 5120 11.7706 95.53% 5.4688
27 attn_v 1152.05 0.0000 21.7389 0.2250 0.5541 99.98% 5120 11.7706 95.53% 5.4688
36 attn_q 1125.94 0.0000 12.8438 0.2199 0.3751 99.98% 5120 11.9677 97.13% 5.8594
36 attn_k 1125.94 0.0000 12.8438 0.2199 0.3751 99.98% 5120 11.9677 97.13% 5.8594
36 attn_v 1125.94 0.0000 12.8438 0.2199 0.3751 99.98% 5120 11.9677 97.13% 5.8594
38 attn_k 1072.28 0.0151 12.4462 0.2094 0.3015 100.00% 5120 12.0386 97.70% 6.4453
38 attn_v 1072.28 0.0151 12.4462 0.2094 0.3015 100.00% 5120 12.0386 97.70% 6.4453
38 attn_q 1072.28 0.0151 12.4462 0.2094 0.3015 100.00% 5120 12.0386 97.70% 6.4453
37 attn_v 1071.17 0.0126 14.2128 0.2092 0.3167 100.00% 5120 12.0204 97.55% 6.2500
37 attn_k 1071.17 0.0126 14.2128 0.2092 0.3167 100.00% 5120 12.0204 97.55% 6.2500
37 attn_q 1071.17 0.0126 14.2128 0.2092 0.3167 100.00% 5120 12.0204 97.55% 6.2500
25 attn_v 1037.08 0.0000 23.9319 0.2026 0.6313 99.98% 5120 11.5734 93.93% 5.4688
25 attn_q 1037.08 0.0000 23.9319 0.2026 0.6313 99.98% 5120 11.5734 93.93% 5.4688
25 attn_k 1037.08 0.0000 23.9319 0.2026 0.6313 99.98% 5120 11.5734 93.93% 5.4688
26 attn_k 1031.55 0.0031 25.6229 0.2015 0.6353 100.00% 5120 11.5771 93.96% 5.6641
26 attn_v 1031.55 0.0031 25.6229 0.2015 0.6353 100.00% 5120 11.5771 93.96% 5.6641
26 attn_q 1031.55 0.0031 25.6229 0.2015 0.6353 100.00% 5120 11.5771 93.96% 5.6641
24 attn_k 955.35 0.0000 20.3266 0.1866 0.5947 99.98% 5120 11.5271 93.55% 5.8594
24 attn_q 955.35 0.0000 20.3266 0.1866 0.5947 99.98% 5120 11.5271 93.55% 5.8594
24 attn_v 955.35 0.0000 20.3266 0.1866 0.5947 99.98% 5120 11.5271 93.55% 5.8594
23 attn_k 950.08 0.0000 22.1702 0.1856 0.6765 99.98% 5120 11.3836 92.39% 5.4688
23 attn_v 950.08 0.0000 22.1702 0.1856 0.6765 99.98% 5120 11.3836 92.39% 5.4688
23 attn_q 950.08 0.0000 22.1702 0.1856 0.6765 99.98% 5120 11.3836 92.39% 5.4688
39 attn_q 926.54 0.0431 16.0860 0.1810 0.2805 100.00% 5120 12.0610 97.88% 5.8594
39 attn_k 926.54 0.0431 16.0860 0.1810 0.2805 100.00% 5120 12.0610 97.88% 5.8594
39 attn_v 926.54 0.0431 16.0860 0.1810 0.2805 100.00% 5120 12.0610 97.88% 5.8594
22 attn_v 916.79 0.0000 18.9033 0.1791 0.5414 99.98% 5120 11.5694 93.89% 5.8594
22 attn_q 916.79 0.0000 18.9033 0.1791 0.5414 99.98% 5120 11.5694 93.89% 5.8594
22 attn_k 916.79 0.0000 18.9033 0.1791 0.5414 99.98% 5120 11.5694 93.89% 5.8594
38 ffn_down 905.56 0.0059 75.8273 0.0655 0.7782 100.00% 13824 11.5526 83.99% 2.0255
19 attn_q 879.58 0.0100 28.6687 0.1718 0.8143 100.00% 5120 10.9550 88.91% 6.0547
19 attn_v 879.58 0.0100 28.6687 0.1718 0.8143 100.00% 5120 10.9550 88.91% 6.0547
19 attn_k 879.58 0.0100 28.6687 0.1718 0.8143 100.00% 5120 10.9550 88.91% 6.0547
36 ffn_up 870.19 0.0086 1.1614 0.1700 0.0388 100.00% 5120 12.2979 99.81% 38.4766
36 ffn_gate 870.19 0.0086 1.1614 0.1700 0.0388 100.00% 5120 12.2979 99.81% 38.4766
37 ffn_up 866.00 0.0098 1.3722 0.1691 0.0456 100.00% 5120 12.2901 99.74% 40.2344
37 ffn_gate 866.00 0.0098 1.3722 0.1691 0.0456 100.00% 5120 12.2901 99.74% 40.2344
21 attn_k 865.62 0.0092 22.5825 0.1691 0.7082 100.00% 5120 11.1497 90.49% 6.0547
21 attn_q 865.62 0.0092 22.5825 0.1691 0.7082 100.00% 5120 11.1497 90.49% 6.0547
21 attn_v 865.62 0.0092 22.5825 0.1691 0.7082 100.00% 5120 11.1497 90.49% 6.0547
13 attn_k 863.66 0.0136 41.3031 0.1687 1.1620 100.00% 5120 10.2387 83.09% 5.6641
13 attn_q 863.66 0.0136 41.3031 0.1687 1.1620 100.00% 5120 10.2387 83.09% 5.6641
13 attn_v 863.66 0.0136 41.3031 0.1687 1.1620 100.00% 5120 10.2387 83.09% 5.6641
3 ffn_down 863.54 0.0001 849.5108 0.0625 7.2252 100.00% 13824 0.2206 1.60% 0.0723
16 attn_v 860.58 0.0155 39.5863 0.1681 1.0040 100.00% 5120 10.5837 85.89% 6.0547
16 attn_q 860.58 0.0155 39.5863 0.1681 1.0040 100.00% 5120 10.5837 85.89% 6.0547
16 attn_k 860.58 0.0155 39.5863 0.1681 1.0040 100.00% 5120 10.5837 85.89% 6.0547
14 attn_q 859.59 0.0144 48.8121 0.1679 1.2058 100.00% 5120 10.1958 82.75% 5.4688
14 attn_v 859.59 0.0144 48.8121 0.1679 1.2058 100.00% 5120 10.1958 82.75% 5.4688
14 attn_k 859.59 0.0144 48.8121 0.1679 1.2058 100.00% 5120 10.1958 82.75% 5.4688
18 attn_k 843.95 0.0084 26.9360 0.1648 0.7675 100.00% 5120 10.9957 89.24% 6.0547
18 attn_v 843.95 0.0084 26.9360 0.1648 0.7675 100.00% 5120 10.9957 89.24% 6.0547
18 attn_q 843.95 0.0084 26.9360 0.1648 0.7675 100.00% 5120 10.9957 89.24% 6.0547
17 attn_k 842.77 0.0124 33.2876 0.1646 0.8841 100.00% 5120 10.7489 87.23% 5.8594
17 attn_v 842.77 0.0124 33.2876 0.1646 0.8841 100.00% 5120 10.7489 87.23% 5.8594
17 attn_q 842.77 0.0124 33.2876 0.1646 0.8841 100.00% 5120 10.7489 87.23% 5.8594
38 ffn_up 840.16 0.0088 2.6975 0.1641 0.0626 100.00% 5120 12.2701 99.58% 36.9141
38 ffn_gate 840.16 0.0088 2.6975 0.1641 0.0626 100.00% 5120 12.2701 99.58% 36.9141
35 ffn_up 835.32 0.0068 1.1382 0.1631 0.0333 100.00% 5120 12.3025 99.84% 40.2344
35 ffn_gate 835.32 0.0068 1.1382 0.1631 0.0333 100.00% 5120 12.3025 99.84% 40.2344
15 attn_q 820.47 0.0159 44.4388 0.1602 1.1185 100.00% 5120 10.2600 83.27% 5.2734
15 attn_v 820.47 0.0159 44.4388 0.1602 1.1185 100.00% 5120 10.2600 83.27% 5.2734
15 attn_k 820.47 0.0159 44.4388 0.1602 1.1185 100.00% 5120 10.2600 83.27% 5.2734
20 attn_k 810.73 0.0080 22.8515 0.1583 0.7303 100.00% 5120 10.9871 89.17% 6.0547
20 attn_v 810.73 0.0080 22.8515 0.1583 0.7303 100.00% 5120 10.9871 89.17% 6.0547
20 attn_q 810.73 0.0080 22.8515 0.1583 0.7303 100.00% 5120 10.9871 89.17% 6.0547
34 ffn_up 799.17 0.0067 1.0181 0.1561 0.0281 100.00% 5120 12.3064 99.87% 38.2812
34 ffn_gate 799.17 0.0067 1.0181 0.1561 0.0281 100.00% 5120 12.3064 99.87% 38.2812
12 attn_v 782.01 0.0126 46.9238 0.1527 1.2340 100.00% 5120 9.8808 80.19% 5.2734
12 attn_q 782.01 0.0126 46.9238 0.1527 1.2340 100.00% 5120 9.8808 80.19% 5.2734
12 attn_k 782.01 0.0126 46.9238 0.1527 1.2340 100.00% 5120 9.8808 80.19% 5.2734
33 ffn_up 764.58 0.0056 0.8259 0.1493 0.0239 100.00% 5120 12.3087 99.89% 46.4844
33 ffn_gate 764.58 0.0056 0.8259 0.1493 0.0239 100.00% 5120 12.3087 99.89% 46.4844
32 ffn_gate 736.26 0.0046 0.7709 0.1438 0.0227 100.00% 5120 12.3091 99.90% 45.8984
32 ffn_up 736.26 0.0046 0.7709 0.1438 0.0227 100.00% 5120 12.3091 99.90% 45.8984
10 attn_v 713.91 0.0092 39.3571 0.1394 1.0706 100.00% 5120 9.9807 81.00% 5.6641
10 attn_k 713.91 0.0092 39.3571 0.1394 1.0706 100.00% 5120 9.9807 81.00% 5.6641
10 attn_q 713.91 0.0092 39.3571 0.1394 1.0706 100.00% 5120 9.9807 81.00% 5.6641
9 attn_v 709.57 0.0059 35.1349 0.1386 0.9907 100.00% 5120 10.0564 81.61% 6.6406
9 attn_k 709.57 0.0059 35.1349 0.1386 0.9907 100.00% 5120 10.0564 81.61% 6.6406
9 attn_q 709.57 0.0059 35.1349 0.1386 0.9907 100.00% 5120 10.0564 81.61% 6.6406
31 ffn_gate 706.57 0.0035 0.5213 0.1380 0.0190 100.00% 5120 12.3114 99.91% 53.9062
31 ffn_up 706.57 0.0035 0.5213 0.1380 0.0190 100.00% 5120 12.3114 99.91% 53.9062
11 attn_k 695.69 0.0103 44.5534 0.1359 1.1356 100.00% 5120 9.7664 79.26% 5.4688
11 attn_q 695.69 0.0103 44.5534 0.1359 1.1356 100.00% 5120 9.7664 79.26% 5.4688
11 attn_v 695.69 0.0103 44.5534 0.1359 1.1356 100.00% 5120 9.7664 79.26% 5.4688
30 ffn_gate 678.07 0.0041 0.5778 0.1324 0.0203 100.00% 5120 12.3097 99.90% 47.6562
30 ffn_up 678.07 0.0041 0.5778 0.1324 0.0203 100.00% 5120 12.3097 99.90% 47.6562
39 ffn_gate 648.54 0.0191 5.6152 0.1267 0.0890 100.00% 5120 12.2396 99.33% 12.3047
39 ffn_up 648.54 0.0191 5.6152 0.1267 0.0890 100.00% 5120 12.2396 99.33% 12.3047
29 ffn_up 647.83 0.0048 0.4959 0.1265 0.0169 100.00% 5120 12.3115 99.92% 62.6953
29 ffn_gate 647.83 0.0048 0.4959 0.1265 0.0169 100.00% 5120 12.3115 99.92% 62.6953
28 ffn_up 621.34 0.0073 0.4593 0.1214 0.0171 100.00% 5120 12.3108 99.91% 59.5703
28 ffn_gate 621.34 0.0073 0.4593 0.1214 0.0171 100.00% 5120 12.3108 99.91% 59.5703
27 ffn_gate 596.51 0.0036 0.5035 0.1165 0.0176 100.00% 5120 12.3092 99.90% 63.4766
27 ffn_up 596.51 0.0036 0.5035 0.1165 0.0176 100.00% 5120 12.3092 99.90% 63.4766
8 attn_q 595.64 0.0067 34.9034 0.1163 0.8977 100.00% 5120 9.9023 80.36% 5.8594
8 attn_v 595.64 0.0067 34.9034 0.1163 0.8977 100.00% 5120 9.9023 80.36% 5.8594
8 attn_k 595.64 0.0067 34.9034 0.1163 0.8977 100.00% 5120 9.9023 80.36% 5.8594
37 ffn_down 592.02 0.0074 16.6926 0.0428 0.1790 100.00% 13824 12.6990 92.32% 25.3906
26 ffn_gate 568.09 0.0044 0.5478 0.1110 0.0182 100.00% 5120 12.3079 99.89% 53.3203
26 ffn_up 568.09 0.0044 0.5478 0.1110 0.0182 100.00% 5120 12.3079 99.89% 53.3203
25 ffn_gate 542.26 0.0052 0.5749 0.1059 0.0192 100.00% 5120 12.3055 99.87% 47.0703
25 ffn_up 542.26 0.0052 0.5749 0.1059 0.0192 100.00% 5120 12.3055 99.87% 47.0703
7 attn_k 536.38 0.0000 37.2838 0.1048 0.9200 99.98% 5120 9.3955 76.25% 6.6406
7 attn_q 536.38 0.0000 37.2838 0.1048 0.9200 99.98% 5120 9.3955 76.25% 6.6406
7 attn_v 536.38 0.0000 37.2838 0.1048 0.9200 99.98% 5120 9.3955 76.25% 6.6406
24 ffn_gate 513.76 0.0061 0.6509 0.1003 0.0216 100.00% 5120 12.3012 99.83% 37.5000
24 ffn_up 513.76 0.0061 0.6509 0.1003 0.0216 100.00% 5120 12.3012 99.83% 37.5000
6 attn_k 511.80 0.0000 34.5247 0.1000 0.7756 99.98% 5120 9.8035 79.56% 7.4219
6 attn_v 511.80 0.0000 34.5247 0.1000 0.7756 99.98% 5120 9.8035 79.56% 7.4219
6 attn_q 511.80 0.0000 34.5247 0.1000 0.7756 99.98% 5120 9.8035 79.56% 7.4219
36 ffn_down 493.83 0.0075 5.3032 0.0357 0.0743 100.00% 13824 13.0480 94.86% 44.4879
23 ffn_gate 488.15 0.0045 0.7809 0.0953 0.0255 100.00% 5120 12.2943 99.78% 17.9688
23 ffn_up 488.15 0.0045 0.7809 0.0953 0.0255 100.00% 5120 12.2943 99.78% 17.9688
22 ffn_up 461.78 0.0070 0.8592 0.0902 0.0298 100.00% 5120 12.2841 99.69% 12.8906
22 ffn_gate 461.78 0.0070 0.8592 0.0902 0.0298 100.00% 5120 12.2841 99.69% 12.8906
5 attn_k 461.03 0.0000 27.0042 0.0900 0.7100 99.96% 5120 9.4849 76.98% 8.9844
5 attn_v 461.03 0.0000 27.0042 0.0900 0.7100 99.96% 5120 9.4849 76.98% 8.9844
5 attn_q 461.03 0.0000 27.0042 0.0900 0.7100 99.96% 5120 9.4849 76.98% 8.9844
21 ffn_up 432.89 0.0068 1.0011 0.0845 0.0359 100.00% 5120 12.2675 99.56% 10.5469
21 ffn_gate 432.89 0.0068 1.0011 0.0845 0.0359 100.00% 5120 12.2675 99.56% 10.5469
4 attn_k 416.60 0.0000 25.1496 0.0814 0.6785 99.96% 5120 9.2580 75.13% 9.9609
4 attn_v 416.60 0.0000 25.1496 0.0814 0.6785 99.96% 5120 9.2580 75.13% 9.9609
4 attn_q 416.60 0.0000 25.1496 0.0814 0.6785 99.96% 5120 9.2580 75.13% 9.9609
35 ffn_down 411.85 0.0053 7.9751 0.0298 0.0819 100.00% 13824 13.0757 95.06% 28.2841
20 ffn_gate 403.55 0.0171 1.2925 0.0788 0.0435 100.00% 5120 12.2438 99.37% 8.7891
20 ffn_up 403.55 0.0171 1.2925 0.0788 0.0435 100.00% 5120 12.2438 99.37% 8.7891
19 ffn_gate 382.99 0.0103 1.2834 0.0748 0.0409 100.00% 5120 12.2452 99.38% 8.9844
19 ffn_up 382.99 0.0103 1.2834 0.0748 0.0409 100.00% 5120 12.2452 99.38% 8.9844
18 ffn_gate 360.11 0.0086 1.1621 0.0703 0.0419 100.00% 5120 12.2340 99.29% 9.1797
18 ffn_up 360.11 0.0086 1.1621 0.0703 0.0419 100.00% 5120 12.2340 99.29% 9.1797
34 ffn_down 343.68 0.0057 1.9176 0.0249 0.0342 100.00% 13824 13.3093 96.76% 43.4028
17 ffn_up 336.38 0.0122 1.4292 0.0657 0.0480 100.00% 5120 12.2045 99.05% 8.5938
17 ffn_gate 336.38 0.0122 1.4292 0.0657 0.0480 100.00% 5120 12.2045 99.05% 8.5938
16 ffn_gate 311.79 0.0122 1.7776 0.0609 0.0573 100.00% 5120 12.1552 98.65% 8.3984
16 ffn_up 311.79 0.0122 1.7776 0.0609 0.0573 100.00% 5120 12.1552 98.65% 8.3984
33 ffn_down 307.16 0.0097 7.3743 0.0222 0.0698 100.00% 13824 13.2318 96.20% 14.9740
15 ffn_up 288.24 0.0109 2.0467 0.0563 0.0615 100.00% 5120 12.1205 98.37% 8.0078
15 ffn_gate 288.24 0.0109 2.0467 0.0563 0.0615 100.00% 5120 12.1205 98.37% 8.0078
14 ffn_up 272.26 0.0103 2.6254 0.0532 0.0710 100.00% 5120 12.0645 97.91% 7.8125
14 ffn_gate 272.26 0.0103 2.6254 0.0532 0.0710 100.00% 5120 12.0645 97.91% 7.8125
32 ffn_down 270.24 0.0095 0.7403 0.0195 0.0193 100.00% 13824 13.4759 97.97% 46.8027
13 ffn_up 254.86 0.0113 2.6888 0.0498 0.0725 100.00% 5120 12.0363 97.68% 7.2266
13 ffn_gate 254.86 0.0113 2.6888 0.0498 0.0725 100.00% 5120 12.0363 97.68% 7.2266
31 ffn_down 250.66 0.0086 0.9231 0.0181 0.0188 100.00% 13824 13.4937 98.10% 43.7645
12 ffn_gate 239.95 0.0166 2.6666 0.0469 0.0752 100.00% 5120 11.9867 97.28% 7.2266
12 ffn_up 239.95 0.0166 2.6666 0.0469 0.0752 100.00% 5120 11.9867 97.28% 7.2266
30 ffn_down 237.44 0.0079 0.5803 0.0172 0.0149 100.00% 13824 13.5080 98.20% 50.7812
11 ffn_up 230.23 0.0148 2.8725 0.0450 0.0777 100.00% 5120 11.9567 97.04% 7.0312
11 ffn_gate 230.23 0.0148 2.8725 0.0450 0.0777 100.00% 5120 11.9567 97.04% 7.0312
29 ffn_down 227.64 0.0074 6.8119 0.0165 0.0593 100.00% 13824 13.3079 96.75% 7.5231
10 ffn_up 220.84 0.0059 2.3218 0.0431 0.0624 100.00% 5120 12.0437 97.74% 7.4219
10 ffn_gate 220.84 0.0059 2.3218 0.0431 0.0624 100.00% 5120 12.0437 97.74% 7.4219
39 attn_output 213.80 0.0049 1.7995 0.0418 0.0570 100.00% 5120 11.6992 94.95% 90.6250
3 attn_k 212.66 0.0000 17.1690 0.0415 0.4298 99.98% 5120 8.5517 69.40% 7.0312
3 attn_q 212.66 0.0000 17.1690 0.0415 0.4298 99.98% 5120 8.5517 69.40% 7.0312
3 attn_v 212.66 0.0000 17.1690 0.0415 0.4298 99.98% 5120 8.5517 69.40% 7.0312
9 ffn_gate 211.89 0.0064 1.9591 0.0414 0.0548 100.00% 5120 12.0596 97.87% 7.6172
9 ffn_up 211.89 0.0064 1.9591 0.0414 0.0548 100.00% 5120 12.0596 97.87% 7.6172
2 attn_v 211.81 0.0000 13.5470 0.0414 0.5105 99.86% 5120 7.5117 60.96% 5.0781
2 attn_q 211.81 0.0000 13.5470 0.0414 0.5105 99.86% 5120 7.5117 60.96% 5.0781
2 attn_k 211.81 0.0000 13.5470 0.0414 0.5105 99.86% 5120 7.5117 60.96% 5.0781
28 ffn_down 210.59 0.0071 0.7934 0.0152 0.0169 100.00% 13824 13.4661 97.90% 42.6794
27 ffn_down 204.54 0.0061 8.1876 0.0148 0.0705 100.00% 13824 13.2151 96.08% 4.0509
26 ffn_down 195.28 0.0058 3.9368 0.0141 0.0383 100.00% 13824 13.2929 96.64% 14.0336
8 ffn_gate 189.36 0.0115 1.6949 0.0370 0.0461 100.00% 5120 12.0880 98.10% 7.8125
8 ffn_up 189.36 0.0115 1.6949 0.0370 0.0461 100.00% 5120 12.0880 98.10% 7.8125
38 attn_output 185.57 0.0016 1.4583 0.0362 0.0547 100.00% 5120 11.5948 94.10% 53.1250
25 ffn_down 177.29 0.0051 0.8608 0.0128 0.0142 100.00% 13824 13.4412 97.72% 47.8877
24 ffn_down 167.83 0.0045 0.8385 0.0121 0.0184 100.00% 13824 13.3351 96.95% 32.1904
7 ffn_up 167.13 0.0085 1.2138 0.0326 0.0395 100.00% 5120 12.0921 98.13% 6.8359
7 ffn_gate 167.13 0.0085 1.2138 0.0326 0.0395 100.00% 5120 12.0921 98.13% 6.8359
23 ffn_down 161.22 0.0045 1.2035 0.0117 0.0192 100.00% 13824 13.3102 96.77% 31.1777
22 ffn_down 150.90 0.0038 0.8320 0.0109 0.0151 100.00% 13824 13.3489 97.05% 39.8582
1 attn_k 148.63 0.0000 22.4289 0.0290 0.5286 99.80% 5120 5.8192 47.23% 3.7109
1 attn_q 148.63 0.0000 22.4289 0.0290 0.5286 99.80% 5120 5.8192 47.23% 3.7109
1 attn_v 148.63 0.0000 22.4289 0.0290 0.5286 99.80% 5120 5.8192 47.23% 3.7109
21 ffn_down 147.96 0.0036 1.6641 0.0107 0.0245 100.00% 13824 13.1859 95.86% 19.8206
6 ffn_up 143.83 0.0134 0.7677 0.0281 0.0279 100.00% 5120 12.1471 98.58% 7.4219
6 ffn_gate 143.83 0.0134 0.7677 0.0281 0.0279 100.00% 5120 12.1471 98.58% 7.4219
37 attn_output 127.32 0.0007 1.2476 0.0249 0.0382 100.00% 5120 11.6690 94.70% 36.5234
36 attn_output 124.95 0.0022 0.7087 0.0244 0.0317 100.00% 5120 11.7572 95.42% 64.4531
20 ffn_down 119.81 0.0030 0.3580 0.0087 0.0095 100.00% 13824 13.4021 97.44% 53.0237
5 ffn_gate 114.26 0.0015 0.5836 0.0223 0.0180 100.00% 5120 12.1927 98.95% 8.2031
5 ffn_up 114.26 0.0015 0.5836 0.0223 0.0180 100.00% 5120 12.1927 98.95% 8.2031
19 ffn_down 110.82 0.0026 0.5981 0.0080 0.0117 100.00% 13824 13.3221 96.85% 37.1817
18 ffn_down 100.26 0.0026 1.6162 0.0073 0.0172 100.00% 13824 13.2686 96.46% 18.5185
17 ffn_down 91.33 0.0017 0.9219 0.0066 0.0102 100.00% 13824 13.3992 97.41% 30.8883
4 ffn_gate 87.21 0.0002 0.2963 0.0170 0.0101 100.00% 5120 12.2345 99.29% 10.5469
4 ffn_up 87.21 0.0002 0.2963 0.0170 0.0101 100.00% 5120 12.2345 99.29% 10.5469
16 ffn_down 83.68 0.0018 0.3795 0.0061 0.0068 100.00% 13824 13.4214 97.58% 46.2240
35 attn_output 80.93 0.0009 0.3628 0.0158 0.0178 100.00% 5120 11.8167 95.90% 67.3828
15 ffn_down 69.29 0.0015 0.4523 0.0050 0.0060 100.00% 13824 13.4392 97.70% 43.4028
34 attn_output 68.75 0.0018 0.3458 0.0134 0.0159 100.00% 5120 11.7593 95.43% 90.4297
3 ffn_gate 63.74 0.0000 0.9831 0.0124 0.0160 100.00% 5120 12.1360 98.49% 7.8125
3 ffn_up 63.74 0.0000 0.9831 0.0124 0.0160 100.00% 5120 12.1360 98.49% 7.8125
21 attn_output 63.53 0.0021 0.5559 0.0124 0.0145 100.00% 5120 11.8760 96.38% 53.7109
15 attn_output 63.25 0.0013 0.1506 0.0124 0.0118 100.00% 5120 11.9061 96.62% 81.6406
14 ffn_down 60.91 0.0014 0.3164 0.0044 0.0045 100.00% 13824 13.4907 98.08% 48.8281
32 attn_output 60.46 0.0005 0.4920 0.0118 0.0169 100.00% 5120 11.7173 95.09% 67.5781
14 attn_output 59.20 0.0033 0.2145 0.0116 0.0095 100.00% 5120 12.0477 97.77% 57.4219
31 attn_output 58.85 0.0005 0.4893 0.0115 0.0167 100.00% 5120 11.6401 94.47% 50.1953
16 attn_output 58.58 0.0012 0.1902 0.0114 0.0095 100.00% 5120 12.0063 97.44% 88.8672
17 attn_output 58.46 0.0005 0.2506 0.0114 0.0106 100.00% 5120 11.9494 96.98% 61.5234
33 attn_output 53.96 0.0014 0.2382 0.0105 0.0079 100.00% 5120 12.0467 97.77% 108.9844
24 attn_output 53.59 0.0005 0.5380 0.0105 0.0263 100.00% 5120 11.1589 90.56% 33.2031
13 ffn_down 53.16 0.0012 0.1572 0.0038 0.0035 100.00% 13824 13.5008 98.15% 50.1302
20 attn_output 52.53 0.0015 0.2461 0.0103 0.0114 100.00% 5120 11.8431 96.11% 75.1953
30 attn_output 50.85 0.0007 0.2020 0.0099 0.0085 100.00% 5120 11.9906 97.31% 95.5078
12 ffn_down 46.43 0.0004 0.0648 0.0034 0.0025 100.00% 13824 13.5358 98.41% 70.0231
11 ffn_down 44.24 0.0008 0.4759 0.0032 0.0049 100.00% 13824 13.4624 97.87% 23.6545
13 attn_output 43.56 0.0003 0.1377 0.0085 0.0073 100.00% 5120 11.9801 97.23% 63.0859
12 attn_output 43.40 0.0009 0.1860 0.0085 0.0078 100.00% 5120 11.9642 97.10% 72.8516
11 attn_output 42.74 0.0006 0.5558 0.0083 0.0176 100.00% 5120 11.4660 93.05% 50.1953
25 attn_output 42.61 0.0006 0.3259 0.0083 0.0095 100.00% 5120 11.8723 96.35% 69.9219
23 attn_output 42.58 0.0005 0.1831 0.0083 0.0095 100.00% 5120 11.7843 95.64% 62.6953
19 attn_output 42.16 0.0004 0.2335 0.0082 0.0076 100.00% 5120 12.0083 97.45% 41.7969
26 attn_output 41.73 0.0003 0.2064 0.0082 0.0076 100.00% 5120 11.9276 96.80% 79.4922
27 attn_output 41.03 0.0003 0.8884 0.0080 0.0141 100.00% 5120 11.8718 96.35% 25.7812
22 attn_output 40.76 0.0003 0.1580 0.0080 0.0071 100.00% 5120 11.8881 96.48% 99.6094
18 attn_output 40.68 0.0014 0.2471 0.0079 0.0069 100.00% 5120 12.0482 97.78% 57.2266
10 ffn_down 39.95 0.0006 0.1846 0.0029 0.0025 100.00% 13824 13.5468 98.49% 48.9728
2 ffn_up 38.98 0.0000 0.1812 0.0076 0.0036 100.00% 5120 12.2648 99.54% 7.4219
2 ffn_gate 38.98 0.0000 0.1812 0.0076 0.0036 100.00% 5120 12.2648 99.54% 7.4219
29 attn_output 38.72 0.0016 0.0977 0.0076 0.0053 100.00% 5120 12.0489 97.78% 130.2734
28 attn_output 38.28 0.0006 0.1802 0.0075 0.0064 100.00% 5120 11.9516 96.99% 131.0547
10 attn_output 36.31 0.0004 0.1589 0.0071 0.0085 100.00% 5120 11.7977 95.75% 60.7422
9 ffn_down 36.00 0.0006 0.7241 0.0026 0.0067 100.00% 13824 13.3678 97.19% 10.7784
8 ffn_down 30.51 0.0004 0.3576 0.0022 0.0042 100.00% 13824 13.3650 97.17% 20.4716
9 attn_output 25.89 0.0003 0.1683 0.0051 0.0074 100.00% 5120 11.6535 94.58% 51.5625
7 ffn_down 25.57 0.0002 0.3904 0.0018 0.0055 100.00% 13824 13.1784 95.81% 9.4763
6 ffn_down 18.29 0.0003 0.1456 0.0013 0.0018 100.00% 13824 13.4276 97.62% 35.3733
0 attn_q 18.29 0.0000 5.9196 0.0036 0.0950 94.32% 5120 4.4566 36.17% 4.8828
0 attn_k 18.29 0.0000 5.9196 0.0036 0.0950 94.32% 5120 4.4566 36.17% 4.8828
0 attn_v 18.29 0.0000 5.9196 0.0036 0.0950 94.32% 5120 4.4566 36.17% 4.8828
8 attn_output 17.56 0.0001 0.0978 0.0034 0.0039 100.00% 5120 11.8420 96.10% 55.8594
1 ffn_gate 17.11 0.0000 0.5277 0.0033 0.0083 100.00% 5120 11.9241 96.77% 5.0781
1 ffn_up 17.11 0.0000 0.5277 0.0033 0.0083 100.00% 5120 11.9241 96.77% 5.0781
7 attn_output 13.82 0.0001 0.0629 0.0027 0.0034 100.00% 5120 11.7857 95.65% 51.5625
5 ffn_down 12.69 0.0001 0.3858 0.0009 0.0034 100.00% 13824 13.2589 96.39% 7.2338
6 attn_output 9.60 0.0000 0.0566 0.0019 0.0026 100.00% 5120 11.6751 94.75% 54.8828
4 ffn_down 7.48 0.0001 0.0299 0.0005 0.0006 100.00% 13824 13.4405 97.71% 54.4705
0 ffn_gate 7.24 0.0000 0.3432 0.0014 0.0109 99.94% 5120 9.7065 78.77% 6.4453
0 ffn_up 7.24 0.0000 0.3432 0.0014 0.0109 99.94% 5120 9.7065 78.77% 6.4453
5 attn_output 6.31 0.0000 0.0573 0.0012 0.0018 100.00% 5120 11.7298 95.19% 33.3984
4 attn_output 4.28 0.0000 0.0411 0.0008 0.0016 100.00% 5120 11.5801 93.98% 32.4219
0 ffn_down 4.25 0.0000 3.6589 0.0003 0.0312 99.73% 13824 1.6508 12.00% 0.1447
3 attn_output 3.57 0.0000 0.0637 0.0007 0.0025 100.00% 5120 10.5307 85.46% 26.9531
2 ffn_down 2.67 0.0000 0.0087 0.0002 0.0002 100.00% 13824 13.3953 97.39% 44.5602
1 ffn_down 2.13 0.0000 0.6453 0.0002 0.0061 100.00% 13824 8.4307 61.29% 0.3617
2 attn_output 1.46 0.0000 0.0200 0.0003 0.0005 100.00% 5120 11.4702 93.09% 42.7734
1 attn_output 1.05 0.0000 0.0229 0.0002 0.0006 100.00% 5120 10.2723 83.37% 50.5859
0 attn_output 0.46 0.0000 0.0577 0.0001 0.0011 90.25% 5120 7.1328 57.89% 12.8906 DiscussionSo I'm not sure how best to read these stats and interpret the *Just to confirm, what you are calling "ZD Score" is calculated using
Anyway, just some observations. I didn't slice the data to look at the Fascinating stuff, hopefully I can dig in more later this week! Cheers! |
Fascinating stuff indeed @ubergarm, and apparently not without controversy 🙃 In a room full of PhDs, I'd be Howard Wolowitz 🤣 so, dear reader, please take everything that follows with the proverbial pinch of salt, and do not pull back from pointing out errors or gaps in my logic. The notion of determining the importance of a specific tensor in a specific layer by somehow measuring the degree of transformation of the hidden states (be it with importance scores, cosine similarity, etc.) as the tokens "flow" from that layer to the next seems -intuitively- reasonable to me and, as few have correctly pointed out, having access to the weights during those transformations will yield significantly better measurements. In my case however, and for the reasons explained above, I'm left with the next best option, which is the sum of the squared activations (imatrix importance scores) for specific tensors in specific layers. That's what I'm calling I'm emphasising specific tensor & specific layer to signify that the stats should be used to compare between tensors of the same type only. In other words, thinking that To validate the hypothesis we of course need lots of tests, but so far, and based solely on layer-wise quantizing DeepSeek-R1-Distill-Qwen-7B, it seems to hold (approach and results in my previous comment 👆 and corresponding imatrix stats at the end 👇 ). Testing other models is needed, but so far so good. I have indeed taken the paper's ZD concept and applied it to the activations. Their Z-score Distribution (a better name would be z-score density, IMO) is nothing more than the percentage of elements that have a z-score greater than 1 standard deviation from the mean. I haven't had a chance to really grok the relevance of this metric, but suspect that in combination with the normalized entropy it may give insights into whole layer scoring, but that's a (pruning) story for another day... Computing statistics for imatrix-DeepSeek-R1-Distill-Qwen-7B-small.dat (197 tensors)
|
A new
--show-statistics
option generates a report highlighting which tensors/layers contribute the most in a model. The report is sorted from the highest influence to lowest. The process computes the average value of scores per tensor/layer and calculates their % contribution, exiting immediately after completion.This PR can be used along with quantize: Handle user-defined quantization levels for additional tensors to do layer-wise quantization similar, but not quite the same, to the process described in Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
Output example:
llama-imatrix --in-file imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat --show-statistics