Llama 4 Scout hybrid quant experiment #13040

steampunque · 2025-04-21T03:11:31Z

steampunque
Apr 21, 2025

I spent some time working on quantizing Llama 4 Scout. This was my first experience trying to get a model to work at quants below 4b. Due to 108B model size it was necessary to go for Q2 or Q3 for this model to get usable TG on my setup (I current have 3 4070s on a local LAN).

SUBJECTIVE RESULTS :

Q2_K : unusable. The model intermittently generated strange looking artifacts/nonsense words in the text. I tried stock Q2_K and also Q2_K with Q5_K embeddings and output. Both generated the intermittent nonsense words.

IQ3_XS : unusable, resulted in super slow generation around 5t/s even with speculation. It did not show artifacts or nonsense words.

Q3_K_S : unusable. Chinese characters appear where numbers should be in answering numeric questions.

3. **Determining the number林逸 sisters Sally has**:

Q3_K_M : excellent. No artifacts/nonsense words, model very smart and knowledgeable on prompts. It feels subjectively like the model woke up from a haze when moving to this quant. I backed off output layer to Q5_K from Q6_K to save a small amount of space. The disdavantage was only 31 out of 49 layers could offload to my 3 4070s.

Hybrid quant experiment: I modified llama-quant.cpp to force odd layers to a specified quant so I could quant to Q3_K_M on even layers and Q2_K on odd layers as follows:

        LLAMA_LOG_INFO("[%4d/%4d] %36s - [%s], type = %6s, ",
               ++idx, ml.n_tensors,
               ggml_get_name(tensor),
               llama_format_tensor_shape(tensor).c_str(),
               ggml_type_name(tensor->type));

        #define FORCE_ODD 1

        #if (FORCE_ODD)
        int blk_layer=0;
        if (name.find("blk.") != std::string::npos) {
           size_t i0 = name.find(".");
           size_t i1 = name.find(".",i0+1);
           std::string layers = name.substr(i0+1,i1-i0-1);
           blk_layer = std::stoi(layers);
           printf("LAYER : %s = %d\n",layers.c_str(),blk_layer);
           }
        #endif

.
.
.

        if (quantize) {
            new_type = default_type;

            #if (FORCE_ODD)
            if (blk_layer & 1)
               new_type = GGML_TYPE_Q2_K;
            #endif

.
.
.

I called this quant Q3_Q2_K_M_3_5. I used Q3_K embed and Q5_K output. It generated a 48G vs a 51G gguf for Q3_K_M.

Q3_Q2_K_M_3_5 : very good. No artifacts or strange words noticed, still smart in answering all prompts, boosted prompt processing +10t/s and token gen +1 t/s over Q3_K_M (3 more layers offloaded). Subjectively on creative tasks I think it generated less rich output though (asked it to write some scifi story in styles of PKD, Niven, Van Vogt and Q3_K_M seemed noticeably richer than Q3_Q2_K_M_3_5.

Summary table:

Quant	NGL	Usable	PPL	PP	TG	Comment
Q2_K_M	41/49	No	11.4	77	12	Nonsense words in output
IQ3_XS	36/49	No	12.3	-	5.5	Too slow when partially offloaded, high perplexity
Q3_K_S	35/49	No	10.1	-	7.2	Artifacts in generation (chinese characters where numbers should be)
Q3_K_M	31/49	Yes	10.2	44	8.5	Excellent performance
Q3_Q2_K_M	34/49	Yes	10.9	57	9.4	Good performance

I am using Llama 3.2 1B as a speculator with dynamic translation between vocabs of Scout and Llama 3.1 1B together with my custom speculation code.

Q3_K_M_3_5:
PN=426 PP=44.26287104350641 TG=8.743014182872457 DN=468 DA=270

Q3_Q2_K_M_3_5:
PN=418 PP=56.93405375998026 TG=9.34140627496675 DN=498 DA=252

PN= generated tokens
PP = prompt processing
TG = token gen
DN = drafted tokens
DA = accepted draft tokens

Conclusion: For my setup I think Q3_K_M_3_5 is the best. Token gen is not that much faster with Q3_Q2_K_M hybrid quant and perplexity took a pretty big percent hit and creative writing is not as good with the hybrid, but it does not generate artifacts or chinese characters for numbers so it would be usable.

steampunque · 2025-04-21T17:20:48Z

steampunque
Apr 21, 2025
Author

Take two on Llama 4 Scout quant experiment. I increase sophistication of layer quant control allowing both quant and output type to be specified per layer. First try I just speced quant but output type was frozen at Q3_K_M. With new approach I get true Q3_K_M and Q2_K_M on the layers:

Code:

.
.

        LLAMA_LOG_INFO("[%4d/%4d] %36s - [%s], type = %6s, ",
               ++idx, ml.n_tensors,
               ggml_get_name(tensor),
               llama_format_tensor_shape(tensor).c_str(),
               ggml_type_name(tensor->type));

        #define HYBRID 1

        #if (HYBRID)
        int blk_layer=0;
        bool blk=name.find("blk.") != std::string::npos;
        if (blk) {
           size_t i0 = name.find(".");
           size_t i1 = name.find(".",i0+1);
           std::string layers = name.substr(i0+1,i1-i0-1);
           blk_layer = std::stoi(layers);
           printf("LAYER : %s = %d\n",layers.c_str(),blk_layer);
           }
        #endif

.
.
         if (quantize) {
            new_type = default_type;

            #if (HYBRID)
            static const ggml_type quants[]=
               {GGML_TYPE_Q2_K, //0
                GGML_TYPE_Q3_K, //1
                GGML_TYPE_Q4_K  //2
               };

            static const llama_ftype ftypes[]=
               {LLAMA_FTYPE_MOSTLY_Q2_K_S, //0
                LLAMA_FTYPE_MOSTLY_Q2_K,   //1
                LLAMA_FTYPE_MOSTLY_Q3_K_S, //2
                LLAMA_FTYPE_MOSTLY_Q3_K_M, //3
                LLAMA_FTYPE_MOSTLY_Q3_K_L, //4
                LLAMA_FTYPE_MOSTLY_Q4_K_S, //5
                LLAMA_FTYPE_MOSTLY_Q4_K_M  //6
                };

            #define L(array) (sizeof(array)/sizeof(*array))

            #if 0
            /* Llama-4 Scout 48 layer alt Q3_K_M, Q2_K. No artifacts. */
            static const int hybrid_quant[]={1,0};
            static const int hybrid_ftype[]={3,1};
            #endif

            #if 0
            /* Llama-4 Scout 48 layer alt Q2_K, Q3_K_M. Artifacts (chinese characters) */
            static const int hybrid_quant[]={0,1};
            static const int hybrid_ftype[]={1,3};
            #endif

            #if 0
            /* Llama-4 Scout 48 layer first 24 Q3_K_M, last 24 Q2_K.
               Performance = terrible.  Many Artifacts, high perplexity. */
            static const int hybrid_quant[]=
               {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
                0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
            static const int hybrid_ftype[]=
               {3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
                1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
            #endif

            #if 1
            /* Llama-4 Scout 48 layer first 24 Q2_K, last 24 Q3_K_M
               Performance = excellent.  No artifacts, low perplexity. */
            static const int hybrid_quant[]=
               {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
            static const int hybrid_ftype[]=
               {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
                3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3};
            #endif

            auto new_ftype=ftype;
            if (blk) {
               new_type = quants[hybrid_quant[blk_layer % L(hybrid_quant)]];
               new_ftype = ftypes[hybrid_ftype[blk_layer % L(hybrid_ftype)]];
               }
            #endif

            // get more optimal quantization type based on the tensor shape, layer, etc.
            if (!params->pure && ggml_is_quantized(default_type)) {
                new_type = llama_tensor_get_type(qs, new_type, tensor, new_ftype);
                // unless the user specifies a type

.
.
.

Description of quants :

Q3_24_Q2_24_K_M_3_5 : Q3_K_M layers 0..23, Q2_K_M layers 24..47, embedding Q3_K_M output Q5_K_M
Q2_24_Q3_24_K_M_3_5 : Q2_K_M layers 0..23, Q3_K_M layers 24..47, " "
Q3_Q2_K_M_3_5 : Q3_K_M Q2_K_M alternate starting at layer 0 ""
Q2_Q3_K_M_3_5 : Q2_K_M Q3_K_M alternate starting at layer 0 ""

Results :

Quant	size	NGL	Usable	PPL	PP	TG	Comment
Q3_24_Q2_24_K_M_3_5	45.7e9	33/49	No	11.5	-	-	Garbage. Massive artifacts high PPL
Q2_24_Q3_24_K_M_3_5	45.4e9	33/49	Yes	10.2	54	9.5	Excellent. No artifacts low PPL.
Q3_Q2_K_M_3_5	45.6e9	34/49	Yes	11.1	58	9.9	Good. No artifacts, OK PPL. True Q2_K_M on Q2 layers.
Q2_Q3_K_M_3_5	45.5e9	34/49	No	10.4	53	10.6	Artifacts, high PPL. True Q2_K_M on Q2 layers.

Conclusion:
Q2_24_Q3_24_K_M_3_5 performs very close to Q3_K_M_3_5 with 45.4e9 byte file vs 51.6e9 byte file. Its hard to tensor split this thing across 3 gpus though due to layers being different sizes across big blocks and I still haven't figured out how to get the full 34 offloaded and distributed across the 3 4070s so it will run so I just backed off to 33.

2 replies

mvx-team Apr 22, 2025

Could the Q2_24_Q3_24_K_M_3_5 (whew!) be run as a single model on unified memory architectures for integrated GPUs or CPUs?

steampunque Apr 22, 2025
Author

Could the Q2_24_Q3_24_K_M_3_5 (whew!) be run as a single model on unified memory architectures for integrated GPUs or CPUs?

Yes, it can run as a single model with no issues. The problem with the block layer splits is when you try to use -ts to allocate among different back ends (such as my RPC setup), if the layers are not uniform size the splitting gets very tricky to not overflow one of the backends. However I think there is something to the idea of quantizing one set of block layers more than the other in terms of how it affects model performance (different layers have different impacts on model performance) but I have no final conclusions as of now. I have zero confidence in using PPL as a basis for model quality however (to select a better model among relatively small differences in PPL at least).

I decided against using any Q2 anywhere in the model. I think Q2 inherently lobotomizes too much information from the layers. Although the model can be patched up and made quite conversant with hybrid quants it loses a lot of knowledge. I tried an interleaved "Q3_K_MS_3_5" (Q3_K_M on even layers and Q3_K_S on odd layers) and it made a 49G file and shows excellent performance and can be easily tensor split since its uniformly distributed. I will also try out a block split of Q3_K_S / Q3_K_M (Q3_24_Q3_24_K_SM) and an even block split of Q3_K_S / Q3_K_M / Q3_K_L (Q3_16_Q3_16_Q3_16_K_SML) sometime. Both of those will be hard to tensor split though.

EDIT:
The hybrid Q3 results came back mostly as expected. Summary of results:

Description of 3b hybrid quants :
Q3_16_Q3_16_Q3_16_K_SML_3_5 : Q3_K_S layers 0..15, Q3_K_M layers 16....31, Q3_K_L layers 32..47
Q3_24_Q3_24_K_SM_3_5 : Q3_K_S layers 0..23, Q3_K_M layers 24..47, " "
Q3_K_MS_3_5 : Q3_K_M Q3_K_S alternate starting at layer 0 ""
Q3_K_SM_3_5 : Q3_K_S Q3_K_M alternate starting at layer 0 ""

Results :

Quant	size	NGL	Usable	PPL	PP	TG	Comment
Q3_K_M_3_5	51.6e9	31/49	Yes	10.2	44	8.6	Baseline Q3_K_M with Q5_K output layer
Q3_16_Q3_16_Q3_16_K_SML_3_5	51.2e9	29/49	Yes	9.8	46	7.7	Excellent best knowledge
Q3_24_Q3_24_K_SM_3_5	49.0e9	31/49	Yes	9.9	50	7.8	Excellent.
Q3_K_MS_3_5	49.2e9	34/49	Yes	10.2	55	9.3	Good. No artifacts.
Q3_K_SM_3_5	49.1e9	34/49	No	10.3	55	9.3	Artifacts. (chinese characters)

Conclusions:
Alternating quant layer by layer is suboptimal, agrees with earlier results. Best performance keeps high precision block quants toward the output, this contiguous high performance block helps pick up the less precise lower layers. However interleaving reduced quants per layer never gives the model a chance to recover from the imprecise layer.

Block mixture quant of Q3_K_S, Q3_K_M, Q3_K_L is smaller file than Q3_K_M with significantly better perplexity. Less layers get offloaded since the large layers are loaded into GPU so GPU gets less of them and CPU picks up more, resulting in slower TG.

Take 3 will be coming up where I experiment with grading up to Q6K at the output and stick with Q3_K_S longer at early layers to keep file size down.

steampunque · 2025-04-23T20:27:55Z

steampunque
Apr 23, 2025
Author

Take three on Llama 4 Scout quant experiment. In this test I cover all quants from Q3_K_S to Q6_K:

diff --git a/src/llama-quant.cpp b/src/llama-quant.cpp
index 7dc54227..f36d1d75 100644
--- a/src/llama-quant.cpp
+++ b/src/llama-quant.cpp
@@ -739,6 +739,20 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
 	       llama_format_tensor_shape(tensor).c_str(),
 	       ggml_type_name(tensor->type));
 
+	#define HYBRID 1
+
+	#if (HYBRID)
+	int blk_layer=0;
+	bool blk=name.find("blk.") != std::string::npos;
+	if (blk) {
+	   size_t i0 = name.find(".");
+	   size_t i1 = name.find(".",i0+1);
+	   std::string layers = name.substr(i0+1,i1-i0-1);
+	   blk_layer = std::stoi(layers);
+	   printf("LAYER : %s = %d\n",layers.c_str(),blk_layer);
+	   }
+	#endif
+
 	// This used to be a regex, but <regex> has an extreme cost to compile times.
 	bool quantize = name.rfind("weight") == name.size() - 6; // ends with 'weight'?
 
@@ -790,9 +804,152 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
 	if (quantize) {
 	    new_type = default_type;
 
+	    #if (HYBRID)
+	    static const ggml_type quants[]=
+	       {GGML_TYPE_Q2_K, //0
+		GGML_TYPE_Q3_K, //1
+		GGML_TYPE_Q4_K, //2
+		GGML_TYPE_Q5_K, //3
+		GGML_TYPE_Q6_K  //4
+	       };
+
+	    static const llama_ftype ftypes[]=
+	       {LLAMA_FTYPE_MOSTLY_Q2_K_S, //0
+		LLAMA_FTYPE_MOSTLY_Q2_K,   //1
+		LLAMA_FTYPE_MOSTLY_Q3_K_S, //2
+		LLAMA_FTYPE_MOSTLY_Q3_K_M, //3
+		LLAMA_FTYPE_MOSTLY_Q3_K_L, //4
+		LLAMA_FTYPE_MOSTLY_Q4_K_S, //5
+		LLAMA_FTYPE_MOSTLY_Q4_K_M, //6
+		LLAMA_FTYPE_MOSTLY_Q5_K_S, //7
+		LLAMA_FTYPE_MOSTLY_Q5_K_M, //8
+		LLAMA_FTYPE_MOSTLY_Q6_K    //9
+		};
+
+	    #define L(array) (sizeof(array)/sizeof(*array))
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer alt Q3_K_M, Q2_K. No artifacts. */
+	    static const int hybrid_quant[]={1,0};
+	    static const int hybrid_ftype[]={3,1};
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer alt Q2_K, Q3_K_M. Artifacts (chinese characters) */
+	    static const int hybrid_quant[]={0,1};
+	    static const int hybrid_ftype[]={1,3};
+	    #endif
+ 
+	    #if 0
+	    /* Llama-4 Scout 48 layer first 24 Q3_K_M, last 24 Q2_K.
+	       Performance = terrible.  Many Artifacts, high perplexity. */
+	    static const int hybrid_quant[]={
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+	       };
+	    static const int hybrid_ftype[]={
+	       3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
+	       };
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer first 24 Q2_K, last 24 Q3_K_M
+	       Performance = excellent.  No artifacts, low perplexity. */
+	    static const int hybrid_quant[]={
+	       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
+	       };
+	    static const int hybrid_ftype[]={
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
+	       };
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer first 12 Q2_K,  next 12 Q3_K_S, next 12 Q3_K_M,  last 12 Q3_K_L
+	       Performance = excellent.  No artifacts, low perplexity. */
+	    static const int hybrid_quant[]={
+		0,0,0,0,0,0,0,0,0,0,0,0,
+		1,1,1,1,1,1,1,1,1,1,1,1,
+		1,1,1,1,1,1,1,1,1,1,1,1,
+		1,1,1,1,1,1,1,1,1,1,1,1
+		};
+	    static const int hybrid_ftype[]={
+	       1,1,1,1,1,1,1,1,1,1,1,1,
+	       2,2,2,2,2,2,2,2,2,2,2,2,
+	       3,3,3,3,3,3,3,3,3,3,3,3,
+	       4,4,4,4,4,4,4,4,4,4,4,4
+	       };
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer alt Q3_K_M, Q3_K_S. */
+	    static const int hybrid_quant[]={1,1};
+	    static const int hybrid_ftype[]={3,2};
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer alt Q3_K_S, Q3_K_M. */
+	    static const int hybrid_quant[]={1,1};
+	    static const int hybrid_ftype[]={2,3};
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer first 24 Q3_K_S, last 24 Q3_K_M
+	       Performance = excellent.  No artifacts, low perplexity. */
+	    static const int hybrid_quant[]={
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
+	       };
+	    static const int hybrid_ftype[]={
+	       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+	       3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
+	       };
+	    #endif
+
+	    #if 0
+	    /* Llama-4 Scout 48 layer first 16 Q3_K_S, next 16 Q3_K_M, last 16 Q3_K_L
+	       Performance = excellent.  No artifacts, low perplexity. */
+	    static const int hybrid_quant[]={
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
+	       };
+	    static const int hybrid_ftype[]={
+	       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+	       3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
+	       4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
+	       };
+	    #endif
+
+	    #if 1
+	    /* Llama-4 Scout 32 layer block Q3_K_S, 10 layer block Q3_K_M,  then ramp up through K quants to Q6_K at output
+	       Q3_32_Q3_10_K_SM_Q3_Q4_Q5_Q6_K_SML_3_6 */
+	    static const int hybrid_quant[]={
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	       1,1,1,1,1,1,1,1,1,1,1,2,2,3,3,4
+	       };
+	    static const int hybrid_ftype[]={
+	       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+	       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+	       3,3,3,3,3,3,3,3,3,3,4,5,6,7,8,9
+	       };
+	    #endif
+
+	    auto new_ftype=ftype;
+	    if (blk) {
+	       new_type = quants[hybrid_quant[blk_layer % L(hybrid_quant)]];
+	       new_ftype = ftypes[hybrid_ftype[blk_layer % L(hybrid_ftype)]];
+	       }
+	    #else
+	    #define new_ftype ftype
+	    #endif
+
 	    // get more optimal quantization type based on the tensor shape, layer, etc.
 	    if (!params->pure && ggml_is_quantized(default_type)) {
-                new_type = llama_tensor_get_type(qs, new_type, tensor, ftype);
+		new_type = llama_tensor_get_type(qs, new_type, tensor, new_ftype);
 		// unless the user specifies a type
 		if (params->tensor_types) {
 		    const std::vector<tensor_quantization> & tensor_types = *static_cast<const std::vector<tensor_quantization> *>(params->tensor_types);

Description:
Q3_32_Q3_10_K_SM_Q3_Q4_Q5_Q6_K_SML_3_6 32 layers Q3_K_S, 10 layers Q3_K_M, ramp from Q3_K_L to Q6_K on final layer.

Comparison:

Quant	size	NGL	Usable	PPL	PP	TG	Comment
Q3_K_M_3_5	51.6e9	31/49	Yes	10.2	44	8.6	Baseline Q3_K_M with Q5_K output layer
Q3_16_Q3_16_Q3_16_K_SML_3_5	51.2e9	29/49	Yes	9.8	46	7.7	Excellent best knowledge
Q3_24_Q3_24_K_SM_3_5	49.0e9	31/49	Yes	9.9	50	7.8	Excellent.
Q3_32_Q3_10_K_SM_Q3_Q4_Q5_Q6_K_SML_3_6	50.8e9	32/49	Yes	9.7	48	7.9	Excellent.

This quant is optimized for my 3x4070 RPC setup to barely offload to the 3 (I kept adding Q3_K_M layers until I railed the VRAM with NGL=32). It can be improved by backing more Q3_K_Ms into the second block. However I find the performance extremely
good and I am happy with it. Its very good at writing, quite knowledgeable, handles many trick questions accurately.

EDIT:

I tweaked layer quants and got better results, uploaded final version I call Q4_K_H to HF here:

https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF

0 replies

segmond · 2025-04-23T20:39:19Z

segmond
Apr 23, 2025

Read up Unsloth's dynamic quants...
https://unsloth.ai/blog/dynamic-4bit

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 4 Scout hybrid quant experiment #13040

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Llama 4 Scout hybrid quant experiment #13040

steampunque Apr 21, 2025

Replies: 3 comments · 2 replies

steampunque Apr 21, 2025 Author

mvx-team Apr 22, 2025

steampunque Apr 22, 2025 Author

steampunque Apr 23, 2025 Author

segmond Apr 23, 2025

steampunque
Apr 21, 2025

Replies: 3 comments 2 replies

steampunque
Apr 21, 2025
Author

steampunque Apr 22, 2025
Author

steampunque
Apr 23, 2025
Author

segmond
Apr 23, 2025