Skip to content

Support AWQ format models #1350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 19, 2025
Merged

Conversation

guoqingbao
Copy link
Contributor

@guoqingbao guoqingbao commented May 19, 2025

This PR adds support for AWQ quantization:

Usage:

Step 1: Convert AWQ model to marlin compatible format (only qzeros need to be converted)

 python3 scripts/convert_awq_marlin.py --src /home/Meta-Llama-3.1-8B-Instruct-AWQ-INT4/ --dst /home/Meta-Llama-3.1-8B-Instruct-AWQ-INT4-Marlin/ --bits 4

Step 2: Run the converted model

./mistralrs-server -i plain -m /home/Meta-Llama-3.1-8B-Instruct-AWQ-INT4-Marlin/

Summary by CodeRabbit

  • New Features

    • Added support for AWQ quantization, including zero-point handling and activation order, in 4-bit and 8-bit Marlin CUDA kernels.
    • Introduced a script to convert AWQ quantized models to a Marlin-compatible format.
    • Expanded documentation with instructions for running AWQ models.
  • Bug Fixes

    • Improved naming consistency for quantization parameters and configuration variants.
  • Refactor

    • Unified internal structures and APIs to support both GPTQ and AWQ quantization schemes.
  • Documentation

    • Updated quantization support information and usage instructions for AWQ models.

Copy link

coderabbitai bot commented May 19, 2025

Walkthrough

This update introduces support for AWQ quantization alongside GPTQ in the Marlin kernel and the Rust quantization backend. It adds a Python script for converting AWQ models to a Marlin-compatible format, updates kernel logic and FFI interfaces for AWQ, and refactors Rust code to handle a unified GPTQ/AWQ variant with appropriate field and parameter changes throughout.

Changes

File(s) Change Summary
README.md Updated instructions for running AWQ models, added conversion steps and clarified quantization support.
scripts/convert_awq_marlin.py New script to convert AWQ quantized models to Marlin-compatible format, including tensor permutation and packing logic.
mistralrs-quant/kernels/marlin/marlin/marlin.cuh Added constants, enum for quantization types, and refactored utility functions for expanded quantization support.
mistralrs-quant/kernels/marlin/marlin_kernel.cu Extended Marlin CUDA kernel for AWQ support, added zero-point handling, new kernel launch logic, and AWQ repacking kernel.
mistralrs-quant/src/gptq/marlin_ffi.rs Expanded and renamed FFI functions to support AWQ and GPTQ, added zero-point parameters, and unified stream handling.
mistralrs-quant/src/gptq/marlin_backend.rs Refactored backend to support both AWQ and GPTQ via unified structs, added logic for AWQ-specific kernel and repack calls.
mistralrs-quant/src/gptq/gptq_cuda.rs Refactored for AWQ support, renamed fields, added is_awq flag, and updated kernel/repack calls and tensor logic.
mistralrs-quant/src/gptq/gptq_cpu.rs Updated to use unified GPTQ/AWQ variant, added is_awq handling for tensor shapes and optional fields.
mistralrs-quant/src/lib.rs Renamed and extended quantization config enums to support both GPTQ and AWQ with new is_awq boolean.
mistralrs-quant/src/distributed/layers.rs Updated pattern matching and error messages to handle unified GPTQ/AWQ variant.
mistralrs-quant/src/gptq/ffi.rs Renamed function parameters for zero-points and scales to generic names.
mistralrs-quant/src/afq/mod.rs
mistralrs-quant/src/bitsandbytes/mod.rs
mistralrs-quant/src/blockwise_fp8/mod.rs
mistralrs-quant/src/fp8/mod.rs
mistralrs-quant/src/gguf/mod.rs
mistralrs-quant/src/hqq/mod.rs
mistralrs-quant/src/unquantized/mod.rs
Updated pattern matching to use unified GPTQ/AWQ variant in unreachable arms; no logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant AWQ_Model
    participant convert_awq_marlin.py
    participant Marlin_Server

    User->>convert_awq_marlin.py: Run conversion (src, dst, bits)
    convert_awq_marlin.py->>AWQ_Model: Read safetensors, permute/pack zero-points
    convert_awq_marlin.py->>User: Write Marlin-compatible files

    User->>Marlin_Server: Launch with converted model
    Marlin_Server->>Marlin_Server: Detect AWQ quantization
    Marlin_Server->>Marlin_Server: Use AWQ kernel logic (zero-points, scales)
Loading
sequenceDiagram
    participant RustBackend
    participant MarlinKernel
    participant CUDA

    RustBackend->>MarlinKernel: Call marlin_matmul(..., is_awq)
    MarlinKernel->>CUDA: Launch AWQ or GPTQ kernel (based on is_awq)
    CUDA-->>MarlinKernel: Matrix multiplication result
    MarlinKernel-->>RustBackend: Return output tensor
Loading

Poem

In the meadow of bits, a new path appears,
AWQ joins GPTQ—let's give three cheers!
With kernels and scripts, we permute and pack,
Now Marlin runs models with zero-point knack.
🐇✨ Quantized bunnies hop,
Supporting more formats—no sign to stop!

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4338328 and bbdbd68.

📒 Files selected for processing (4)
  • mistralrs-core/src/speech_models/dia/mod.rs (1 hunks)
  • mistralrs-quant/src/gptq/gptq_cpu.rs (3 hunks)
  • mistralrs-quant/src/gptq/gptq_cuda.rs (17 hunks)
  • mistralrs-quant/src/lib.rs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • mistralrs-quant/src/gptq/gptq_cpu.rs
  • mistralrs-quant/src/lib.rs
  • mistralrs-quant/src/gptq/gptq_cuda.rs
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Clippy
  • GitHub Check: Docs
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
🔇 Additional comments (1)
mistralrs-core/src/speech_models/dia/mod.rs (1)

461-465: Great refactoring using match on comparison ordering!

This change improves the code by replacing an if-else chain with a more idiomatic Rust pattern matching on the ordering comparison. The match statement explicitly handles all three possible cases (Equal, Greater, Less), making the code more readable and maintainable. Even though the "Less" case doesn't require any action, explicitly accounting for it ensures that future developers understand all cases are considered.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

Code Metrics Report
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           62           53            0            9
 Dockerfile              1           41           22           10            9
 JSON                   12          107          106            0            1
 Makefile                1            6            5            0            1
 Python                 84         3713         3163          140          410
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   19          557          512            6           39
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               55         5002            0         3812         1190
 |- BASH                 8          104          101            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                22          757          634            1          122
 |- TOML                 2           75           63            0           12
 (Total)                           6071          919         3813         1339
-------------------------------------------------------------------------------
 Rust                  378       126695       113093         2588        11014
 |- Markdown           171         2145           29         1913          203
 (Total)                         128840       113122         4501        11217
===============================================================================
 Total                 562       139990       116999         8989        14002
===============================================================================

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (1)
mistralrs-quant/src/lib.rs (1)

201-208: 🛠️ Refactor suggestion

Return "awq" in name() when is_awq is true

QuantizedConfig::name() still hard-codes "gptq" for the unified GptqAwq variant.
Down-stream code, log messages and user-visible metadata will therefore mis-report AWQ models as GPTQ, which is confusing and can break feature flags or analytics that rely on the string value.

-            Self::GptqAwq { .. } => "gptq",
+            Self::GptqAwq { is_awq, .. } => {
+                if *is_awq { "awq" } else { "gptq" }
+            }
🧹 Nitpick comments (7)
README.md (1)

66-74: Add language specifiers to fenced code blocks

The instructions for AWQ format models are clear and detailed, but the code blocks should have language specifiers for proper syntax highlighting.

- ```
+ ```bash
  python3 scripts/convert_awq_marlin.py --src /home/Meta-Llama-3.1-8B-Instruct-AWQ-INT4/ --dst /home/Meta-Llama-3.1-8B-Instruct-AWQ-INT4-Marlin/ --bits 4
  • ./mistralrs-server -i plain -m /home/Meta-Llama-3.1-8B-Instruct-AWQ-INT4-Marlin/

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

68-68: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

---

72-72: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

</details>

</details>

</blockquote></details>
<details>
<summary>mistralrs-quant/src/gptq/gptq_cpu.rs (1)</summary><blockquote>

`96-96`: **Improve boolean expression readability with parentheses**

The boolean condition is functionally correct but could benefit from explicit parentheses for better readability.

```diff
-        && (is_awq || !is_awq && vb.contains_tensor("g_idx"))
+        && (is_awq || (!is_awq && vb.contains_tensor("g_idx")))
🧰 Tools
🪛 GitHub Check: Check (macOS-latest, stable)

[failure] 96-96:
mismatched types

mistralrs-quant/src/lib.rs (1)

148-162: Simplify boolean assignment & guard against unknown strings

  1. is_awq can be set more idiomatically with the comparison result (m == "awq").
  2. Consider validating bits and group_size (e.g. non-zero, power-of-two) while you are in this branch; malformed configs will not surface until much later.
-                    is_awq: if m == "awq" { true } else { false },
+                    is_awq: m == "awq",
scripts/convert_awq_marlin.py (3)

121-154: Guard against overwriting existing destination files

assert args.dst != "" and not os.path.exists(args.dst) prevents pointing --dst at a pre-existing (but empty) directory.
If the folder exists yet is empty, the script should allow reuse to integrate with CI pipelines.

Consider:

if os.path.exists(args.dst) and os.listdir(args.dst):
    raise ValueError("--dst must be non-existent or empty.")

139-153: Inefficient re-allocation of tgt_dict in loop

transform_file() re-initialises tgt_dict for every shard but the variable from the outer scope is unused afterwards. Dropping the outer declaration removes dead code.

-    tgt_dict = {}
...
-        tgt_dict = {}
+        tgt_dict = {}

155-163: Unused helper load_json

load_json() is defined but never invoked. Delete or wire it up to reduce maintenance burden.

mistralrs-quant/src/gptq/gptq_cuda.rs (1)

426-438: Unconditional perm / g_idx expectations mismatch with is_awq flag

For the AWQ path you deliberately skip loading g_idx and perm.
However, later in marlin_weight_repack(&qweight, &perm, …) the perm option is forwarded even for AWQ, and inside MarlinRepack it is passed straight to the GPU kernel – which expects it to be non-null only when has_perm == true.

Because perm is None, the host code emits a null pointer while the awq_marlin_repack_kernel is compiled with HAS_PERM=false; that’s safe, but still wastes a nullable parameter and adds cognitive overhead. Consider:

let perm = if is_awq { None } else { Some(perm_tensor) };
…
marlin_weight_repack(&qweight, perm.as_ref(), …)

and remove the perm argument from awq_marlin_repack entirely to avoid confusion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between c116ce4 and 4338328.

📒 Files selected for processing (18)
  • README.md (2 hunks)
  • mistralrs-quant/kernels/marlin/marlin/marlin.cuh (2 hunks)
  • mistralrs-quant/kernels/marlin/marlin_kernel.cu (27 hunks)
  • mistralrs-quant/src/afq/mod.rs (1 hunks)
  • mistralrs-quant/src/bitsandbytes/mod.rs (1 hunks)
  • mistralrs-quant/src/blockwise_fp8/mod.rs (1 hunks)
  • mistralrs-quant/src/distributed/layers.rs (4 hunks)
  • mistralrs-quant/src/fp8/mod.rs (1 hunks)
  • mistralrs-quant/src/gguf/mod.rs (1 hunks)
  • mistralrs-quant/src/gptq/ffi.rs (4 hunks)
  • mistralrs-quant/src/gptq/gptq_cpu.rs (4 hunks)
  • mistralrs-quant/src/gptq/gptq_cuda.rs (17 hunks)
  • mistralrs-quant/src/gptq/marlin_backend.rs (9 hunks)
  • mistralrs-quant/src/gptq/marlin_ffi.rs (2 hunks)
  • mistralrs-quant/src/hqq/mod.rs (1 hunks)
  • mistralrs-quant/src/lib.rs (7 hunks)
  • mistralrs-quant/src/unquantized/mod.rs (1 hunks)
  • scripts/convert_awq_marlin.py (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
README.md

68-68: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


72-72: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

🪛 Ruff (0.11.9)
scripts/convert_awq_marlin.py

12-12: Undefined name List

(F821)


15-15: Undefined name List

(F821)

🪛 GitHub Check: Check (macOS-latest, stable)
mistralrs-quant/src/gptq/gptq_cpu.rs

[failure] 119-119:
mismatched types


[failure] 96-96:
mismatched types


[failure] 146-146:
mismatched types

⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Clippy
  • GitHub Check: Docs
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (30)
mistralrs-quant/src/bitsandbytes/mod.rs (1)

208-210: Changes look correct for supporting AWQ format

This match arm update is part of the necessary refactoring to support both GPTQ and AWQ quantization. The renamed GptqAwq variant in the QuantMethodConfig enum enables unified handling of both quantization schemes.

mistralrs-quant/src/blockwise_fp8/mod.rs (1)

34-36: Match pattern updated correctly for AWQ support

This pattern match update correctly handles the renamed GptqAwq variant, maintaining consistency with the other quantization modules.

mistralrs-quant/src/gguf/mod.rs (1)

37-38: Match arm properly updated for AWQ compatibility

The pattern match update correctly handles the renamed GptqAwq variant, ensuring consistency across the codebase's quantization handling.

mistralrs-quant/src/hqq/mod.rs (1)

527-529: Pattern matching correctly updated for AWQ support

This change properly updates the match pattern to use the new GptqAwq variant, maintaining consistency with other quantization modules.

mistralrs-quant/src/fp8/mod.rs (1)

40-40: Pattern matching correctly updated for GptqAwq variant

The match pattern has been correctly updated to use the new GptqAwq variant that unifies GPTQ and AWQ quantization support.

mistralrs-quant/src/afq/mod.rs (1)

97-97: Pattern matching correctly updated for GptqAwq variant

The match pattern has been properly updated to reference the new unified GptqAwq variant instead of the previous Gptq variant.

mistralrs-quant/src/unquantized/mod.rs (1)

34-34: Pattern matching correctly updated for GptqAwq variant

The unreachable match arm pattern has been properly updated to use the new unified GptqAwq variant, maintaining consistency with the other modules.

README.md (1)

164-164: AWQ quantization method documentation

The AWQ quantization support is correctly documented with reference to the conversion script.

mistralrs-quant/src/distributed/layers.rs (6)

62-63: LGTM: GPTQ to GptqAwq rename for AWQ support

The enum variant rename from Gptq to GptqAwq correctly supports the new AWQ format alongside GPTQ in the tensor parallelism check.


74-76: LGTM: Pattern matching updated for combined GPTQ/AWQ support

The match arm correctly handles the renamed GptqAwq variant, maintaining existing functionality while supporting the new AWQ format.


272-273: LGTM: Consistent enum variant rename

This change maintains consistency with the other instances of the renamed enum variant.


278-278: LGTM: Updated error message to include AWQ

Error message correctly updated to reflect that both GPTQ and AWQ formats don't support tensor parallelism.


284-285: LGTM: Consistent pattern matching update

This change maintains consistency with earlier changes to use the renamed GptqAwq variant.


483-484: LGTM: Final GptqAwq pattern match updated

The ReplicatedLayer implementation also correctly uses the renamed enum variant.

mistralrs-quant/src/gptq/ffi.rs (4)

7-8: LGTM: Parameter names generalized for AWQ support

Parameter names have been appropriately renamed from GPTQ-specific (b_gptq_qzeros, b_gptq_scales) to more general ones (b_qzeros, b_scales) to support both GPTQ and AWQ quantization formats.


19-20: LGTM: Consistent parameter renaming

Parameter names consistently updated for the reconstruct_gptq function to align with the generalized naming convention.


32-33: LGTM: Consistent parameter renaming

Parameter names consistently updated for the gemm_half_q_half_cuda_part function.


47-48: LGTM: Consistent parameter renaming

Parameter names consistently updated for the gemm_half_q_half_alt function, maintaining consistent naming across all FFI functions.

mistralrs-quant/kernels/marlin/marlin/marlin.cuh (5)

21-21: LGTM: Added default_threads constant

Added constant defines the default number of threads to use for kernel execution.


26-26: LGTM: Repositioned max_thread_n constant

Constant repositioning maintains better organization of related constants.


31-31: LGTM: Added pipe_stages constant

New constant defines the number of pipeline stages for kernel execution.


33-33: LGTM: Replaced ceildiv with div_ceil

The function has been replaced with a constexpr implementation that can be used in both host and device code, with identical functionality.


118-125: LGTM: Added ScalarTypeID enum for format identification

This enum clearly differentiates between GPTQ (kU4B8, kU8B128) and AWQ (kU4, kU8) quantization formats, supporting the new AWQ functionality.

mistralrs-quant/src/gptq/gptq_cpu.rs (5)

17-19: LGTM: Updated error message for GptqAwq

Error message updated to match the renamed variant, while maintaining the existing functionality that GPTQ/AWQ methods are only supported on CUDA.


83-88: LGTM: Updated GptqAwq pattern matching with is_awq field

The pattern match correctly extracts the new is_awq flag from the configuration, which will be used to differentiate between GPTQ and AWQ formats.


103-109: LGTM: Added shape handling for GPTQ vs AWQ formats

The code correctly handles the different tensor shapes for GPTQ and AWQ formats:

  • GPTQ: quantized along rows (k/pack_factor, n)
  • AWQ: quantized along columns (k, n/pack_factor)

119-123: LGTM: Conditional g_idx loading based on format

The code correctly handles the g_idx tensor, which is only needed for GPTQ (not AWQ):

  • For AWQ: g_idx is set to None
  • For GPTQ: g_idx is loaded from the tensor
🧰 Tools
🪛 GitHub Check: Check (macOS-latest, stable)

[failure] 119-119:
mismatched types


136-147: LGTM: Updated QuantMethodConfig construction with renamed fields

The configuration is correctly constructed with:

  1. Renamed fields (qzeros instead of gptq_qzeros, scales instead of gptq_scales)
  2. The new is_awq flag to differentiate between formats
🧰 Tools
🪛 GitHub Check: Check (macOS-latest, stable)

[failure] 146-146:
mismatched types

mistralrs-quant/src/lib.rs (2)

210-222: Expose correct bit-width description for AWQ

get_bits_name() prints “<n> bits” for both GPTQ and AWQ. If, in future, AWQ supports mixed-precision (e.g. W4A8) the simple pluralisation will be ambiguous. You may want to delegate the stringification to the kernel/backend (or add an act_bits field).


706-709: Ensure gptq_linear handles is_awq correctly

Both linear_no_bias and linear route AWQ through gptq_linear(...). Double-check that gptq_linear branches on the new is_awq flag for:

  • kernel selection (marlin_awq_* vs marlin_gptq_*)
  • zero-point layout (qzeros vs None)
  • workspace sizing

If that logic is still pending you may hit wrong-result bugs at runtime.

Comment on lines +36 to 62
pub(crate) fn marlin_awq_4bit_f16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *const c_void,
m: i32,
k: i32,
n: i32,
workspace: *const c_void,
groupsize: i32,
stream: i64,
);

pub(crate) fn marlin_awq_4bit_bf16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *const c_void,
m: i32,
k: i32,
n: i32,
workspace: *const c_void,
groupsize: i32,
stream: i64,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Mark output / workspace pointers as *mut c_void

The kernels write into out, workspace, and result. Declaring them *const c_void misrepresents the mutability contract and triggers unsafe aliasing warnings downstream.

-        out: *const c_void,
+        out: *mut c_void,
 ...
-        workspace: *const c_void,
+        workspace: *mut c_void,

(The same applies to gptq_marlin_repack / awq_marlin_repack’s result.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub(crate) fn marlin_awq_4bit_f16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *const c_void,
m: i32,
k: i32,
n: i32,
workspace: *const c_void,
groupsize: i32,
stream: i64,
);
pub(crate) fn marlin_awq_4bit_bf16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *const c_void,
m: i32,
k: i32,
n: i32,
workspace: *const c_void,
groupsize: i32,
stream: i64,
);
pub(crate) fn marlin_awq_4bit_f16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *mut c_void,
m: i32,
k: i32,
n: i32,
workspace: *mut c_void,
groupsize: i32,
stream: i64,
);
pub(crate) fn marlin_awq_4bit_bf16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *mut c_void,
m: i32,
k: i32,
n: i32,
workspace: *mut c_void,
groupsize: i32,
stream: i64,
);
🤖 Prompt for AI Agents
In mistralrs-quant/src/gptq/marlin_ffi.rs between lines 36 and 62, the pointers
for output and workspace parameters are declared as *const c_void, which
incorrectly indicates they are immutable. Since the kernels write to out,
workspace, and result, change these pointer types to *mut c_void to correctly
represent their mutability and avoid unsafe aliasing warnings. Apply the same
change to the result pointer in gptq_marlin_repack and awq_marlin_repack
functions.

Comment on lines +8 to 20
pub(crate) fn marlin_gptq_4bit_f16(
inputs: *const c_void,
weight: *const i32,
scales: *const c_void,
zeros: *const c_void,
out: *const c_void,
m: i32,
k: i32,
n: i32,
workspace: *const c_void, //tensor with at least `n / 128 * max_par` entries that are all zero
groupsize: i32,
stream: CUstream,
stream: i64,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use usize (or *mut c_void) instead of i64 for stream to avoid UB on non-64-bit targets

CUstream / hip / metal objects are opaque pointers whose size equals usize.
Casting them through i64 truncates on 32-bit platforms and is technically UB under the strict provenance rules even on 64-bit ones.

-        stream: i64,
+        stream: usize,

Apply the same change to every FFI declaration in this file.

🤖 Prompt for AI Agents
In mistralrs-quant/src/gptq/marlin_ffi.rs around lines 8 to 20, the `stream`
parameter is declared as `i64`, which can cause undefined behavior on non-64-bit
platforms because `CUstream` and similar objects are opaque pointers sized as
`usize`. Change the type of `stream` from `i64` to `usize` or `*mut c_void` to
correctly represent the pointer size and avoid truncation. Apply this type
change consistently to all FFI function declarations in this file that use
`stream`.

Comment on lines +4 to +10
import torch
import numpy
from safetensors.torch import load_file, save_file
import argparse
import os
import shutil

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Import List for type hints

List is referenced in get_scale_perms() but not imported, causing NameError at runtime and Ruff F821.

-import torch
-import numpy
+from typing import List
+
+import torch
+import numpy
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import torch
import numpy
from safetensors.torch import load_file, save_file
import argparse
import os
import shutil
from typing import List
import torch
import numpy
from safetensors.torch import load_file, save_file
import argparse
import os
import shutil
🤖 Prompt for AI Agents
In scripts/convert_awq_marlin.py around lines 4 to 10, the List type hint is
used in the get_scale_perms() function but List is not imported, causing a
NameError at runtime. Fix this by importing List from the typing module at the
top of the file along with the other imports.

Comment on lines +75 to +86
let qzeros_ptr = if self.qzeros.is_some() {
let (qzeros, qzeros_l) = self.qzeros.as_ref().unwrap().storage_and_layout();
let qzeros = match &*qzeros {
Storage::Cuda(p) => p,
_ => candle::bail!("qzeros must be a cuda tensor"),
};
let qzeros_ = qzeros.as_cuda_slice::<i32>()?;
let qzeros_ = qzeros_.slice(qzeros_l.start_offset()..);
*qzeros_.device_ptr() as *const c_void
} else {
std::ptr::null() as *const c_void
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

AWQ path may dereference a null qzeros pointer

When self.qzeros is None, qzeros_ptr is set to nullptr.
For GPTQ kernels that is fine (they ignore the argument), but the AWQ kernels invoked below (marlin_awq_*) expect valid zero-point data and unconditionally load from the pointer. Passing nullptr will lead to undefined behaviour or a bus error on the first memory read.

-        } else {
-            std::ptr::null() as *const c_void
+        } else {
+            candle::bail!("AWQ kernels require non-null qzeros tensor")
         };

Alternatively, guard earlier (e.g. in marlin_matmul) and refuse to run AWQ when qzeros.is_none().
Please cross-check the call-sites – currently gptq_linear constructs an AWQ/Marlin layer with qzeros = None.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let qzeros_ptr = if self.qzeros.is_some() {
let (qzeros, qzeros_l) = self.qzeros.as_ref().unwrap().storage_and_layout();
let qzeros = match &*qzeros {
Storage::Cuda(p) => p,
_ => candle::bail!("qzeros must be a cuda tensor"),
};
let qzeros_ = qzeros.as_cuda_slice::<i32>()?;
let qzeros_ = qzeros_.slice(qzeros_l.start_offset()..);
*qzeros_.device_ptr() as *const c_void
} else {
std::ptr::null() as *const c_void
};
let qzeros_ptr = if self.qzeros.is_some() {
let (qzeros, qzeros_l) = self.qzeros.as_ref().unwrap().storage_and_layout();
let qzeros = match &*qzeros {
Storage::Cuda(p) => p,
_ => candle::bail!("qzeros must be a cuda tensor"),
};
let qzeros_ = qzeros.as_cuda_slice::<i32>()?;
let qzeros_ = qzeros_.slice(qzeros_l.start_offset()..);
*qzeros_.device_ptr() as *const c_void
} else {
candle::bail!("AWQ kernels require non-null qzeros tensor")
};
🤖 Prompt for AI Agents
In mistralrs-quant/src/gptq/marlin_backend.rs around lines 75 to 86, the code
sets qzeros_ptr to nullptr when self.qzeros is None, which causes undefined
behavior in AWQ kernels that expect valid zero-point data. To fix this, add a
check before invoking AWQ kernels to ensure self.qzeros is Some and refuse to
run AWQ if it is None, or alternatively, modify marlin_matmul or the call-sites
like gptq_linear to prevent constructing AWQ/Marlin layers with qzeros set to
None. This prevents passing a null pointer to AWQ kernels and avoids potential
crashes.

Comment on lines +294 to 313
let out = match (self.g_idx.as_ref(), self.qzeros.as_ref(), self.is_marlin) {
(Some(g_idx), Some(qzeros), false) => self
.gptq_gemm(
reshaped_a,
g_idx,
gptq_qzeros,
gptq_qzeros.dim(0)? as i32,
qzeros,
qzeros.dim(0)? as i32,
self.use_exllama,
)?
.reshape(out_shape)?,
(_, _, true) => gptq_marlin_matmul(
(_, _, true) => marlin_matmul(
a,
&self.q_weight,
&self.gptq_scales,
&self.scales,
&self.qzeros,
self.workspace.as_ref().context("Workspace required")?,
self.bits,
self.is_awq,
)?,
_ => unreachable!(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

qzeros lost for AWQ in Marlin format – will crash at runtime

gptq_linear builds the config for already-Marlin checkpoints with:

QuantMethodConfig::GptqAwq {qzeros: None,is_marlin: true, is_awq }

and forward() forwards that to marlin_matmul(...).

The Marlin AWQ kernels, however, require a non-null zero-point tensor (see awq_marlin_* signatures and kernel code). The combination
is_awq == true && qzeros.is_none() therefore produces a null pointer that the CUDA kernel will dereference, resulting in an illegal-access crash.

Please ensure that:

  1. Converted AWQ-Marlin checkpoints ship their zero-points tensor, or
  2. gptq_linear loads and passes it (qzeros: Some(tensor)), or
  3. The kernel is updated to treat nullptr as a broadcast-zero value (currently it is not).

Failing to fix this will make all AWQ/Marlin models unusable.

Comment on lines +1530 to 1576
void marlin_matmul(const void* A, const void* B, void* scales, void* zeros, void* C, int prob_m, int prob_k,
int prob_n, void* workspace, int groupsize, int64_t stream_
) {

int dev = 0;
cudaStream_t stream = (cudaStream_t)stream_;
int thread_k = -1;
int thread_n = -1;
int sms = -1;
int thread_n = -1;
int sms = -1;
int max_par = 16;

int tot_m = prob_m;
int tot_m_blocks = ceildiv(tot_m, 16);
int tot_m_blocks = div_ceil(tot_m, 16);
int pad = 16 * tot_m_blocks - tot_m;

bool has_act_order = false;
bool is_k_full = true;
if (sms == -1)
cudaDeviceGetAttribute(&sms, cudaDevAttrMultiProcessorCount, dev);

int max_shared_mem = 0;
cudaDeviceGetAttribute(&max_shared_mem,
cudaDevAttrMaxSharedMemoryPerBlockOptin, 0);
CHECK(max_shared_mem > 0, "error");
// Set thread config
thread_config_t th_config;
exec_config_t exec_cfg;
if (thread_k != -1 && thread_n != -1) {
// User-defined config
th_config = thread_config_t{thread_k, thread_n, USER_THREADS};
exec_cfg =
exec_config_t{4, thread_config_t{thread_k, thread_n, default_threads}};
} else {
// Auto config
th_config = determine_thread_config(prob_m, prob_n, prob_k);
exec_cfg =
determine_thread_config(prob_m, prob_n, prob_k, num_bits, groupsize,
has_act_order, is_k_full, max_shared_mem);
}

if (!is_valid_config(th_config, prob_m, prob_n, prob_k)) {
assert(false);
}

int num_threads = th_config.num_threads;
thread_k = th_config.thread_k;
thread_n = th_config.thread_n;
int num_threads = exec_cfg.tb_cfg.num_threads;
thread_k = exec_cfg.tb_cfg.thread_k;
thread_n = exec_cfg.tb_cfg.thread_n;

int thread_k_blocks = thread_k / 16;
int thread_n_blocks = thread_n / 16;
int group_blocks = (groupsize == -1) ? -1 : groupsize / 16;
int blocks = sms;
int num_groups = prob_k / groupsize;

if (prob_m == 0 || prob_n == 0 || prob_k == 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential division-by-zero / negative groups & hard-coded dev-id

  1. num_groups = prob_k / groupsize;
    When groupsize == -1 (column-wise scales) this becomes a negative division, producing an undefined result that later propagates to the kernel. Guard for the sentinel value:
int num_groups = (groupsize == -1) ? 1 : prob_k / groupsize;
  1. int dev = 0; is hard-coded, but the stream passed in may belong to another device.
    Query the device from the stream instead:
int dev;
cudaGetDevice(&dev);

or use cudaGetDevice/cudaPointerGetAttributes on one of the input pointers.

Both issues can silently mis-configure execution on multi-GPU nodes or with column-wise quantisation.

🤖 Prompt for AI Agents
In mistralrs-quant/kernels/marlin/marlin_kernel.cu around lines 1530 to 1576,
fix the calculation of num_groups to avoid division by -1 by changing it to set
num_groups to 1 when groupsize is -1, and replace the hard-coded device ID (int
dev = 0) with a dynamic query of the current device using cudaGetDevice(&dev) to
ensure correct device context when using streams or multi-GPU setups.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks great.

@EricLBuehler EricLBuehler merged commit ec43205 into EricLBuehler:master May 19, 2025
13 checks passed
Jeadie added a commit to spiceai/mistral.rs that referenced this pull request Jul 14, 2025
* Fix handling of Metal fused attn head dims (EricLBuehler#1234)

* Fix handling of metal attn head dims

* Fix handling of gemma3 1b when images

* Tweak default for paged attn builder

* Support paged attn for vision model rust api (EricLBuehler#1235)

* [Breaking] Support setting HF cache path (EricLBuehler#1237)

* Add it internally

* Add the apis

* Support tool calling for DeepSeek models (EricLBuehler#1239)

* Support tool calling for deepseek models

* Format

* Fix deepseek

* Server image processing refactor and fixes (EricLBuehler#1244)

* Fix strict gemma3 case

* Accept multiple images in the content array

* Fix multiple images in one array ct

* Add it to the python api

* Typos

* Optimized CUDA RoPE kernels (EricLBuehler#1247)

* Add the kernels

* It works

* Works

* Buulds

* Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246)

* Fix typo

* Update mistralrs.pyi

* Fixes for UQFF + distributed layers (EricLBuehler#1250)

* Fixes for uqff + distributed layers

* Typo

* Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243)

* Add the tool

* Actually search

* Clippy

* Sort of works

* Remove some debuggers

* tweak

* Add some rules

* Works great

* Tweak 'system' prompt

* Update mistralrs-core/src/search/mod.rs

Co-authored-by: Copilot <[email protected]>

* Typo

* Add it to all the apis

* Add bert model for similarity reranking

* Typos

* Early detection of tools

* Alias max_tokens -> max_completion_tokens too

* Customizable bert model

* Flip the enabler around

* Add docs

* Update readme

* Typo

---------

Co-authored-by: Copilot <[email protected]>

* Format kernels (EricLBuehler#1251)

* Update readme

* Update readme

* Remove test

* Add quantize guards for uqff deserialize (EricLBuehler#1252)

* Refactor cuBLASlt-related code (EricLBuehler#1253)

* Centralize cublaslt into mistralrs-quant

* Use cublaslt in unquant layer

* Use beautiful trait constants for simpler code

* Move tests

* Dispatch to unquant for cublaslt

* Dispatch to unquant for cublaslt

* Fix feature

* Add convert_to_gptq script

* Update deps, bump pyo3 version (EricLBuehler#1259)

* Faster cuda FP8 performance (EricLBuehler#1257)

* Avoid fp8 sync

* Fix dtype

* Rust 1.86 clippy (EricLBuehler#1260)

* Rust 1.86 clippy

* Clippy

* Refactor engine arch (EricLBuehler#1262)

* Refactor engine add_request

* Don't recompile regex

* Clippy

* Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263)

* Play with varbuilder lifetimes

* Merge lora weights

* Clippy

* Lora works

* Support multiple loras

* Cleanup, remove adapter activation

* Complete merge

* Fast Metal-specific quantization method: AFQ (EricLBuehler#1264)

* Add mlx quantized kernels

* Add mlx quantized kernels

* Kernel launcher

* Add AFQ isq quant and dequant

* Some quantmethod things

* Begin to implement the qmm caller

* Clippy

* Much faster

* Cache kernels

* Docs

* Clippy

* Add it to uqff

* Support prequantized models from MLX (EricLBuehler#1265)

* Refactor quantizedconfig

* Support AFQ prequantized

* Update docs

* Update docs

* Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266)

* Automatic isq

* typo

* Doc

* Improved usage metrics (EricLBuehler#1267)

* Fix cuda

* Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270)

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.44.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gather MM ops in mistralrs-quant (EricLBuehler#1272)

* Update the caller

* Wire things up

* Broadcase for afq gathermm

* Broadcase for afq gathermm

* Clippy

* Improve performance of deepseek models

* Typo fix

* BincountOp not used

* Implement Llama 4! (EricLBuehler#1268)

* Implement Llama 4

* Implement the main changes for the text model

* Make chunked mask

* Wire things up

* Add some EP

* Initial sketch of inputs processor

* Runs

* Progress

* all reduce moes

* It works!

* Some cleanup

* Faster moe block

* Add device map

* Make chunked matrix

* Fully working now!

* Reactivate cublaslt

* Fix shared mlp cublaslt

* Refactor to packed experts

* Complete merge

* It is a normal model now

* Fixes

* Set device for moe

* ISQ fixes

* Much faster sort kernel

* Faster loading!

* Faster loading!

* Fp8 cpu copy ops in candle backend

* Add the vision model

* Add mmproj layer

* Actually merge the inputs

* Sketch most of the image processor

* Add the rest of the image processor

* Implement the whole processor

* Add the loader

* Some fixes

* A batch of fixes

* Some fixes

* tmp

* Actually support isq

* Ok it works a bit

* Fix norm device

* It works

* A bit cleaner

* Support residul tensors

* Remove text loader

* Implement the device mapping system

* Fix auto device map

* Add examples

* Add model card

* Typo

* Remove superflous logging

* Fixes for Llama 4 UQFF loading (EricLBuehler#1275)

* Support sharding for UQFF (EricLBuehler#1276)

* Serialize sharded uqff files

* Loading

* Fix base64

* Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278)

* Support the DeepCoder model (EricLBuehler#1279)

* Add faq for metal not found

* Improved PagedAttn scheduling accuracy (EricLBuehler#1282)

* Scheduler ops by reference

* Ensure scheduler gets correct prompts

* Fix cuda build for copy_blocks

* Fixes for scheduling image seqs with pagedattn (EricLBuehler#1283)

* update to llguidance 0.7.16 (EricLBuehler#1284)

* update llguidance to 0.7.16 from crates.io; use ParserFactory

* add lark_llg.py example

* use new llguidance::Matcher APIs

* rework spec-decoding with llg

* more work on spec sampling

* check for parser stop

* fix clippy

* remove unneeded rollback

* update build_llg_factory to return Result

* Update dependencies (EricLBuehler#1286)

* Much faster image inputs processing (EricLBuehler#1289)

* Add more SDPA head dims for much faster SigLIP (EricLBuehler#1290)

* More sdpa head dims, faster vision models

* Move nonzero to above for faster metal synch

* Doc

* Update valid head dims

* Show throughput in interactive mode (EricLBuehler#1291)

* Update interactive mode throughput stats

* Accurate prompt t/s

* Accurate prompt t/s for usage

* Unify bitwise operations (EricLBuehler#1288)

* Unify bitwise ops

* Tests pass

* Fix cuda build

* Clippy

* Multimodal prefix caching support! (EricLBuehler#1298)

* Initial progress

* Support vision prefix caching

* Update docs

* Add multimodal data abstraction

* Interactive mode improvements (EricLBuehler#1299)

* More ergonomic image url parsing

* Add option to clear

* Add the Qwen 3 and Qwen 3 MoE models! (EricLBuehler#1285)

* Add qwen3 model

* Add enable_thinking

* Add initial qwen3 moe

* Add the moe model

* Format

* Fix order of norm

* Fix expert shapes

* Fix reverse

* Fix norm device for isq

* Fix nonzero when no nonzero

* Moe model runs

* Working qwen3 moe

* Add metal fp8 blockwise dequant

* Clean

* Typo

* Enable tool calling

* Streamlined ux

* Add some examples

* Add docs

* Fix dead link

* Remove interactive mode max_len

* Update QWEN3.md

* Hotfix for vision mode clear

* Revamped and streaming web search support (EricLBuehler#1301)

* Streaming web search

* Refactor a bit

* More refactoring

* Add some logging, parallelize some things

* Allow url

* Suppress warning, allow multi-turn searching

* Batch compute_similarities

* Cap content len

* Typos

* Doc

* Handle vision messages or different tool call prefixes (EricLBuehler#1302)

* Fix cuda

* Tune web search budget

* Simplify prefix cacher (EricLBuehler#1305)

* Use rustyline to handle non-ascii in interactive mode (EricLBuehler#1306)

The io::stdin().read_line() cannot handle non-ascii input, which caused
crash when use backspace to delete non-ascii characters.

Introduce rustyline to the interactive mode to solve the problem. Plus
it can bring more editing features in the future.

Close EricLBuehler#1140

* Add more tools for automatic search (EricLBuehler#1307)

* Add interactive mode history

* Add a website extraction tool

* Pass toks by reference

* Optimize prompt chunking

* Fix CPU hogging in interactive mode (EricLBuehler#1309)

The log enabler should be checked after the sleep instead of a busy
loop checking.

Since the interactive mode always disables the token speed logger, 100%
CPU was taken by this loop always.

* Add Metal precompilation support  (EricLBuehler#1311)

* Add metal precompilation for paged attn

* Add for mistralrs-quant

* Better constructor

* Dont always build

* Fix name for paged attn rebuild

* Reduce thrashing of Metal autorelease (EricLBuehler#1313)

* Reduce calls to autorelease

* Optimize clone_in_cache

* Refactor float8

* make `AdapterPaths` and `LoraAdapterPaths` public (EricLBuehler#1314)

Make `AdapterPaths` and `LoraAdapterPaths` public so `LocalModelPaths`
can be constructed outside of `mistralrs-core`.

* Refactor KV cache manager (EricLBuehler#1315)

* Refactor kv cache

* Refactor caches

* Fix some overflows

* Add `Audio` and `Speech` model categories (EricLBuehler#1317)

* add `Audio` to `ModelCategory`

* add `Speech` to `ModelCategory`

* fix to go back to PartialEq having an exhaustiveness check

* Remove has_conv2d from vision model API (EricLBuehler#1318)

* Unified/automatic flash attention enabler (EricLBuehler#1319)

* Remove from sdpa params

* Fix errors

* No warnings

* Log

* Clippy

* Fix cublaslt 4d mask (EricLBuehler#1320)

* Fix cublaslt 4d mask

* Clippy

* Keep caches on gpu

* Qwen VL models fixes (EricLBuehler#1322)

* Add some defaults

* Fix

* Fix one thing

* 2.5 vl works

* Use caching again

* Fix v2

* Move index inside loop

* Offset in ropeidx

* Default support for vision prefix caching is false

* Fixes for all vision models (EricLBuehler#1323)

* Fix phi input processor?

* Fix phi input processor

* Handle no_prefix_cache from pipeline

* Phi models confirmed 👍

* Fixed for phi inputs processors

* Fixed for phi4

* Llama 3 confirmed 😀

* Mistral 3 confirmed 😃

* Idefics 2/3 fixes

* Some fixes

* Remove unsafety

* Improved+faster LRU prefix cacher (EricLBuehler#1321)

* Show TTFT

* Use LRU prefix cacher

* Faster prefix cacher

* Inplace ISQ support and default to mmap (EricLBuehler#1277)

* Initial impl of immediate isq

* Immediate isq -> !loading_isq

* Varbuiler utils always using mmap!

* Log

* Add for packed experts

* Afq without copy

* Clarify

* Clippy

* Apple immediate isq

* Better logic for loading_isq

* Support showing ttft

* Rename

* Shared quantize guard

* Parallel progress bar

* Parallel loading for progress bars

* Actual ISQ support

* Conditional parallelism for NiceProgressBar

* Use conditional iterator

* Warn once

* Predicate for applying immediate isq

* Allow parallel

* Remove debug print

* Remove debug print

* Remove debug print

* Fix typos (EricLBuehler#1329)

* Fix Idefics 3 arch chat templating (EricLBuehler#1330)

* Update inputs merger

* Fix

* Better warning

* Better warning

* Better warning

* Nonzero ahead of time

* No f32

* Clippy

* Optimize get_logprobs

* Fix packed experts

* Update masking

* Use Sdpa in idefics3

* QuantMethod in idefics3 vision

* Remove a .contiguous

* Remove two space from PR comment (EricLBuehler#1331)

* Add automatic vision loader type (EricLBuehler#1332)

* Add automatic vision loader

* Remove references to --arch

* Update examples

* Add the Dia 1.6b TTS model! (EricLBuehler#1304)

* Add loading

* Add rope, mlp, most of attn

* Add encoder + encoder layer, decoder layer forwards

* Add decoder forwards

* Add prepare_audio_prompt

* prepare_generation mostly done

* Add a proper dia kvcache

* Add most of decoder_step

* Add the sampler

* Add the generation loop

* Wire things up

* Add speech pipeline

* Fixes

* Loads

* Some fixes

* f32

* Some progress

* Ok it runs upto dac decoding

* Add dac part loading

* Loads and runs at least

* Remove encodec

* Debugging

* Debugging

* Huh

* Complete merge

* Interactive

* Confirmed dac works at least

* Looks like encoder works

* Much progress

* Hmm

* Sampling

* Almost there

* Sampler

* Sampler

* Bf16 support

* Response

* Use it in interactive mode

* Fix oneshot

* Add openai api

* Add openai api

* Refactor loading

* Use naive sdpa for inplace

* Factor out

* Clippy

* Clippy

* Config

* Refactor config

* Metal clippy

* Fix t/s

* ISQ support

* Some fixes, nits

* Fix cuda

* Clippy

* Inhibit cublaslt for cuda

* Add server example

* Add python example

* Add rust api

* Add docs

* Update config.toml

* Fix .pyi

* Update readme

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* update `llguidance` to `0.7.20` (EricLBuehler#1334)

Update `llguidance` from `0.7.16` to `0.7.20` so that it has guidance-ai/llguidance#172 which is a fix for building on GCC 15.

* Add model category <> messages check (EricLBuehler#1335)

* Verify model category matches the messages

* Add vision chat

* Fixes

* Add element-wise normalization check (EricLBuehler#1340)

* Fix streaming example print statement (EricLBuehler#1339)

* Fix normalization formula in comment (EricLBuehler#1338)

* Fix image_to_pixels to handle non-RGB images (EricLBuehler#1337)

* Fix typo in expect messages (EricLBuehler#1342)

* Don't use mmap on cuda (EricLBuehler#1336)

* No mmap on cuda

* Simplify streaming tool call logic

* Remove debug

* Support AWQ format models (EricLBuehler#1350)

* Support AWQ format models

* Clippy fix

* Fix uqff dummy layer ISQ application (EricLBuehler#1351)

* Disable immediate isq if write_uqff (EricLBuehler#1352)

* Fixes for UQFF loading on CUDA, ISQ pack factor (EricLBuehler#1354)

* Fix logic for uqff on cuda

* Updated pack_factor

* Refactor Option references for model paths (EricLBuehler#1347)

* refactor: use Option refs in model path helpers

* Format

* Add a script for server benchmarking (EricLBuehler#1355)

* Serde alias

* Fix

* Update for tie_word_embeddings

* Print running/waiting

* 30 users

* Update num_users

* Update dummy paged attn

* Optimized Metal qmv_fast path (EricLBuehler#1356)

* Compile with lto

* Tweak profiles

* New, fast sampler for Metal! (EricLBuehler#1327)

* Show TTFT

* Use LRU prefix cacher

* Faster prefix cacher

* A bit of gpu sampling

* Minp but cpu for now

* Metal fast cumsum impl

* Sampling with fast topp kernel

* Hmm not perfect

* Add metal sort kernels

* Tmp

* Add single block sort

* Add most of multi block sort, just need copy op

* Add copy kernels

* Expose kernels

* Add a test

* Ok it works

* Structure things

* Add caching

* Rename

* Cpu is default

* CUDA case

* Topk

* Refactor Option references for model paths (EricLBuehler#1347)

* refactor: use Option refs in model path helpers

* Format

* Add a script for server benchmarking (EricLBuehler#1355)

* Serde alias

* Fix

* Update for tie_word_embeddings

* Print running/waiting

* 30 users

* Update num_users

* Update dummy paged attn

* Optimized Metal qmv_fast path (EricLBuehler#1356)

* Compile with lto

* Tweak profiles

* Fix topk

* Penalties

* Add logits processor, clippy fixes

* Fix chat port

* Remove warning

* Fix chat port

* Fix metal parallel sampling (EricLBuehler#1357)

* Cpu if parallel for now

* Tweak bench script

* Add immediate isq predicates for qwen3 (EricLBuehler#1358)

* Add immediate isq predicates for qwen3

* Fix parsing of "parse_isq_value" depedent of device

* Typo

* Fix gemma3 logging

* Regressions fixes (EricLBuehler#1359)

* Fix regression for mmap

* Revert EricLBuehler#1321

* Refactored matching_cache impl

* Clippy

* Revamped and smaller readme (EricLBuehler#1360)

* Expandable detail sections

* Refactor using derivative model

* Tweak quick examples

* Update llama

* Update llama

* Supported accelerators is a table

* Update installation guides

* Tweak apis

* Remove --port in quick examples

* Add demo gif

* Add gif in readme

* Update demo gif

* Update demo gif

* Update demo gif

* Add gif in readme

* Add gif in readme

* Add a web chat app! (EricLBuehler#1362)

* Initial

* Markdown

* Copy code

* Add model loading sidebar

* Support vision models

* Tweak isq

* Links go to another page

* Clear when switch model

* Fix html tags

* Add image support!

* More then one images

* Fix

* Improved textarea

* Tab for switching between vision and text

* No paged attn for now

* Prettier format

* Multiple models at once

* Better switching, clearing ability

* Mobile support

* Inline markdown parser

* Update examples

* Typos

* Support specifying isq

* Fix mobile

* Fixes

* Fix button on mobile

* Image height is capped

* Thumbnail

* Fix rotating kv cache edge case

* Add drag and drop for images

* Small things

* Sidebar is frozen now

* Better listner

* Add readme

* Tweak readme

* Add chat history support to web chat app (EricLBuehler#1363)

* Add chat history

* Support renaming

* Start immediately with new chat

* Add timestamp

* Prettier chat list

* Style

* Delete chat

* Fix copy button

* Fix markdown rendering

* Store things in cache

* Store things in cache

* Refactor web chat, fix multichat image restore (EricLBuehler#1364)

* Fix multichat image restoration.

* Clippy

* Refactor

* Refactor frontent

* Fix repeated immediate isq init (EricLBuehler#1365)

* Add images_ref

* Add debug impl

* Fix the bug

* Tweak style of buttons

* Add a spinner

* Move spinner

* Tweak emoji

* Add gif

* Tweak initial gif

* Include vision tower tensors in Mistral3 UQFF (EricLBuehler#1366)

* Fix mistral 3 uqff resitdual tensors for vision

* Rolling shard creation for uqff files (EricLBuehler#1367)

* Fix occasional unstability during isq of afq (EricLBuehler#1368)

* Fix unstability during isq of afq

* Clippy

* Fix web chat installation

* Support web chat file uploading (EricLBuehler#1370)

* Web chat fixes

* Fix thumbnail in message, reuse blank chat

* Add file uploading support

* Fix scroll

* Allowed extensions

* Preserve files as literals

* Support multiple clients

* Add a stop button

* New cache dir

* New cache dir

* Fix

* Refactor

* Update readme

* Tweak drag-and-drop css

* Add speech generation support to the web chat! (EricLBuehler#1373)

* Initial speech gen support for web chat

* Tweak ui

* Update docs

* Prefix caching for PagedAttention! (EricLBuehler#1369)

* Exposing some things for logical token blocks

* Prefix cache manager has the scheduler

* Refactor

* Get logical and physical blocks into the prefix cacher

* Hash and cache

* Pass physical block prefill

* Allocation of prefilled block tables

* Temp

* Dont always use 2

* Hmm

* Hmm

* It mostly works

* Increment refcount

* Support images!

* Add to dummy paged attn

* Fix some clippy

* Clippy

* More checks

* Include EricLBuehler#1371, closes EricLBuehler#1371

* Typos

* Update docs

* Metal PagedAttention accuracy improvements (EricLBuehler#1374)

* Fix subtle bug

* Fix half sum bug

* Format metal paged attention

* Handle images in paged attn scheduler (EricLBuehler#1375)

* Include schemas needed for chatcompletions endpoint (EricLBuehler#1353)

* EricLBuehler#1326: WIP include schemas needed for chat completions endpoint

 Conflicts:
	Cargo.lock
	mistralrs-server/src/main.rs

* EricLBuehler#1326: WIP define utoipa as a workspace dep since core and server both need it

* EricLBuehler#1326: first draft of handling schemas that use Either

* EricLBuehler#1326: first draft of handling schema for Grammar

* EricLBuehler#1326: Add in other endpoints to API docs.

* EricLBuehler#1326: Adjust code comments

* EricLBuehler#1326: Implement coderabbitai suggestions

- EricLBuehler#1353 (review)
- EricLBuehler#1353 (comment)

* Fix constraints with metal sampler

* Revert EricLBuehler#1375

* Fix case where prefix cacher returns no toks (EricLBuehler#1377)

* Fix AFQ UQFF serialization

* Faster UQFF serialization (EricLBuehler#1379)

* Faster UQFF serialization

* Fix uqff gemma3

* Improve gemma3 auto loader names

* UQFF creation for AFQ on CPU support (EricLBuehler#1380)

* Add afq cpu quantize/dequantize

* Clippy

* Improved device for afq quantize

* Improved dtype handling for cpu afq (de)quantize

* Improved generate_uqff_card

* Add fused CPU attention kernel! (EricLBuehler#1382)

* Working

* Fix warnings

* Allow mask

* Support bf16, f16

* Handle striding

* Parallelized

* Add initial vector flash attn

* Avoid repeated allocations

* Tiled kv

* Apply some clippy

* Some small fixes

* Chunked vec_dot

* Clipy

* Use T::zero

* Refactor attention backends (EricLBuehler#1384)

* Refactor attention code

* Refactor attention code

* Move into backends

* Set macOS thread affinity for CPU attn (EricLBuehler#1385)

* Use lazylock

* Format

* Fix metal warn build

* Faster Qwen 3 MoE support on Metal (EricLBuehler#1387)

* Fix load

* Use afq gather qmm

* Well it runs

* It works

* Polish

* Fast and slow options

* Remove quantized.rs

* Polish some more

* Refactor

* Add isq

* Update load in parallel

* Support fp8

* Refactor for FusedExperts

* Clippy

* Handle pack factor when loading prequantized models

* Use f32 only in moe

* Avoid using f32 so much

* Avoid using f32 so much

* Fix PagedAttention block leaks (EricLBuehler#1388)

* Warn and ignore if ignored

* Fix a block allocation leak

* Update bench.py

* Fix double free in block engine

* Do not apply ISQ if loading a prequantized model

* Fix cuda build again (EricLBuehler#1389)

* Fix cuda build

* Fix

* Format

* Fixes for cuda docker

* Update dockerfiles

* Bump version to 0.6.0 (EricLBuehler#1390)

* Bump version to 0.6.0

* Remove lower_level api

* Make a static dir

* Update deps

* Fix routing for static handler in web chat

* Fewer .contiguous calls for qwen3 moe (EricLBuehler#1391)

* Allow speech models to accept batched inputs (EricLBuehler#1393)

* Allow speech models to accept batched inputs

* Clippy

* Ring distributed backend for heterogeneous TP (EricLBuehler#1238)

* Begin work on ring distributed backend for Metal

* Add the actual ring functionality

* It loads and kind of runs

* It works

* Optimize buffer allocation

* Avoid copy

* It works

* Add allgather

* Fix load

* Ping-pong

* Small things

* Add config json

* Allow different ip address

* Read config once

* Read config when appropriate

* Replicate requests

* Small fix

* Fix small compat with openai

* Clippy

* Update docs

* Add deepseek tool calling chat template

* Add auto loader for vision/text detection! (EricLBuehler#1402)

* Add auto loader for vision/text detection

* Build fixes

* Add model loader

* Update docs

* Format

* Create Mistral.rs Server Core Lib: `mistralrs-server-core` (EricLBuehler#1346)

* First draft of exposing mistral server routes as lib

* make arg struct fields pub

* Take base path so utoipa swagger route can properly redirect

* Expose swagger routes and make it configurable

* Add base path option for swagger docs

* More work on modularizing mistralrs server

* Sync fork (+1 squashed commit)
Squashed commits:
[169ae9e] Sync fork

* Adjust fn params to use refs / individual params instead of args

* Start breaking down controller actions into smaller pieces

* Continue refactoring

* Make mods pub so they can be used outside crate

* Allow chat completion streamer to take a callback so that you can get the complete response when finished

WIP (+3 squashed commits)
Squashed commits:
[0061d87] WIP
[c484d56] WIP
[16f8a60] WIP

* Sync fork

* Adjust callback type

* Remove throughput_log arg that was removed in 26afcc3

* Implement defaults for Args (and use for Clap)

* Small code formatting tweaks

* Rename callback to match SSE event and code clean up

* Sync fork

* WIP: first very rough draft of server core builder. Doesn't meet parity with old functional approach yet (slower / unstable?).

* Clean up (+4 squashed commits)
Squashed commits:
[e1cff387] Sync fork
[d8301025] WIP debugging
[1ea9f8c8] Sync fork
[4fe28cf5] WIP: debug function

* WIP server core builders

* Code clean up

* Add on_chunk callback

* Code clean up

* First draft of creating version of mistral-server that uses server-core

Code clean up (+1 squashed commit)
Squashed commits:
[adea1693]

* Sync fork

* Add helper methods to builder to make optional args more ergonomic (since .build validates params)

* Start adding docs

* Start cleaning up crates deps

* Example commit of mistral-server with implementing server-core

* Start addressing CodeRabbit feedback

* Fix comment typo

* Tweak doc blocks

* - Update type alias naming for clarity (MistralRs instead of Mistral)
- CodeRabbit, don't use eprintln for lib (use trace)
- Allow buffer size to be passed in and default to Constant
- Allow router body limit to be passed in and default to Constant
- Update doc examples

* Typo

* Address CoderRabbitAI feedback

* Support linear rope for llama3 (EricLBuehler#1408)

* Hotfix for loading

* Fix vllama4 uqff loading (EricLBuehler#1409)

* Fix vllama4 uqff loading

* Fix regex

* Fix regex

* Maybe a fix

* Gracefully handle receiver disconnects (EricLBuehler#1410)

* Handle receiver disconnects

* Format

* Fix Qwen3 MoE device mapping irregularities (EricLBuehler#1411)

* Fix bias

* Fix lm_head packing case

* Account for gate

* Fix head dim

* Fix interactive mode URL parsing (EricLBuehler#1412)

* fix url regex in vision interactive mode

* Fix regex

* Clippy

* Refactor auto device map (EricLBuehler#1413)

* Refactor auto device map

* Refactor a bit more

* Clippy

* Enable runtime sampling tweaks in interactive mode (EricLBuehler#1414)

* Document runtime sampling commands

* Fix readme

* Tweak

* Bounds checking

* Tweak temp bounds

* Send streaming tokens every time

* Gumbel sampling for fast sampler (EricLBuehler#1416)

* Improved handling for initialize_logging

* Improved CPU flash attention accuracy & performance (EricLBuehler#1417)

* Downcast correctly

* Operate internally in f32

* Avoid some casts and striding

* Prefetch

* Provide chat_templates to container users (EricLBuehler#1419)

Models often come without chat templates requiring mapping them
from the source repository into a container for access by the
mistralrs-server.

Copy the templates from the build tree into the root of the image
to permit use via `--chat-template /chat_templates/something.json`

TODO:
  With the increase in quantized models and support for other
formats, the initial benchmark run during model load can be used
to qualify/select existing chat templates embedded into the binary
for models which do not come with any (to include output of the
functional failures in each test allowing users to modify the
ones already provided correctly to suit the model being loaded).

Co-authored-by: RageLtMan <rageltman [at] sempervictus>

* Faster cpu flash attn (EricLBuehler#1418)

* Faster cpu flash attn

* Prefetch

* Clippy

* Add some tests

* Add softcap tests

* Fix test_parse_image_url test

* Update tests

* Update tests

* Web search improvements (bm25, web chat) (EricLBuehler#1420)

* Fix web search blocking case

* Web search support in web chat

* Tweak ui

* Support fallback to bm25

* Clippy

* Reinject descriptions

* Propely handle consecutive searches (EricLBuehler#1421)

* Update extraction tool reinjection

* Looped

* Update docs (EricLBuehler#1422)

- lib.rs: clean up example var names and match logging change from EricLBuehler@201d6be
- server_builder: fix typo
- READMEs: link to crate docs

* Better tool call detection logic (EricLBuehler#1424)

* Add web search hook callbacks (EricLBuehler#1426)

* feat: add customizable search hook

* Move to builder

* Update docs

* Fix CUDA context switching, bind thread on CudaStorage drop (EricLBuehler#1428)

* Add CUDA context helper and use in Llama forward

* No flashparams?

* working

* Tweak

* Update to use dep

* conditionally build flash attention inputs (EricLBuehler#1429)

* Add AGENTS.md (EricLBuehler#1430)

* Support Qwen3 GGUF model (EricLBuehler#1432)

* Support QWen3 GGUF model

* Clippy fix

* cargo fmt

* Improved paged attn prefix caching (EricLBuehler#1434)

* Improved paged attn prefix caching

* Disable

* Clippy

* Temporary fix for qwen3 gguf tokenizer (EricLBuehler#1433)

* Temporary fix for qwen3 gguf tokenizer

* Typo fix

* Add tool callback support (EricLBuehler#1427)

* Add tool callback support

* Fixes

* Support named tool callbacks

* Update examples

* Update docs

* Clippy

* Centralize crate dependencies (EricLBuehler#1438)

* chore: centralize dependencies

* Format

* Fix bug in tokenizer created with gguf metadata (EricLBuehler#1440)

* Fix bug in tokenizer created with gguf metadata

* Clippy fix

* Update deps (EricLBuehler#1441)

* Small things

* Update deps

* Update deps

* Update breaking changes

* Doc fixes (EricLBuehler#1442)

* Mention uqff_maker

* Downgrade rustyline 16.0.0 -> 15.0.0 (EricLBuehler#1444)

* Add max_completion_tokens alias for server (EricLBuehler#1451)

* Audio input support (Phi 4 multimodal) (EricLBuehler#1448)

* Deps

* Add conformer

* Nemo loading

* Position embeds

* Load t5 attn bias

* Attn and feed forward

* Add conv module and glu pointwise

* Implement relative attn bias

* Add the forward methods

* Add encoder embedding

* Fix oproj

* Some loading

* Conformer loads!

* Fully loading speech stack

* Merger

* Dont need that

* First pass at audio processing

* Read samples

* Optional

* Small loading fix

* Runs but not correct yet

* Improved audio processing?

* Works with this

* Fix t5 attn bias

* It works!

* Comment

* Use some other crates

* Clippy

* Allow bf16 on metal

* Add prefix_audio

* Remove unused

* Typo

* User specified

* Add audio url parsing

* AudioProjectionMode -> InputMode

* Audio prefix caching

* Fix bug in audio prefix caching

* Support both at the same time!

* Tweak logging

* Support stereo

* Add mistralrs-audio

* Support batching

* Add server and rust api example

* Add python api

* Fix add_multimodal_message

* Fix unfold for conformer

* Streaming example

* Add web chat support

* Add modalities registry

* Fix offline cache issue for gguf models (EricLBuehler#1452)

* Add MCP server endpoints (EricLBuehler#1453)

* feat(server): add MCP server support

* Add mcp docs

* Add handle_list_tools_request

* Better launch, tool handling

* Tmp state

* Ok works

* Handle modalities

* Update docs

* Add ping

* Tweak temperature bounds, args

* MCP documentation pass (EricLBuehler#1455)

* Fix table

* Update mcp docs

* Improve readme header

* Improve readme header

* Integrate an MCP client (EricLBuehler#1456)

* Add builtin mcp client

* Use async loader

* Add headers

* Handle sse

* More flexible search request

* Add tool callbacks with tools, for mcp

* Add bearer token support

* Add websocket support

* Update docs

* Add python api

* Clippy

* Add http api, docs

* Tests pass

* Make these configs actually work

* Add docs

* Make mistralrs-mcp

* Refactor examples

* Update examples

* Add defaults

* Add defaults

* Add defaults

* Update docs

* Improved docs

* Add -y to npx usages

* Even better examples

* Update generate_wheels

* Update generate_wheels

* Update generate_wheels

* Fix Dockerfile.cuda-all

* Improve automatic tool call (EricLBuehler#1460)

* Improved auto tool call

* Add logging

* chore: `Dockerfile.cuda-all` configurable threads (EricLBuehler#1458)

* chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (EricLBuehler#1459)

* Add fallback definition for isnan (EricLBuehler#1463)

* chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465)

* chore: Dockerfile - Remove rayon threads env

* chore: Dockerfile - Improve formatting for `apt-get`

* Remove duplicate calls for api_dir_list (EricLBuehler#1474)

* Remove duplicate calls for api_dir_list

* Support local cache for api_dir_list

* Fix home folder for metal

* Capitalized

* Fix transient pyo3 dep (EricLBuehler#1478)

Co-authored-by: Eric Buehler <[email protected]>

* Fix objc dep with non macos (EricLBuehler#1480)

* Fix phi 3/4 + nccl issue (EricLBuehler#1481)

* Fix log

* Fix n kv heads

* Fix phi3.5 moe (EricLBuehler#1482)

* Fix phi3.5 moe accum device

* Fix again

* Fix again

* Support GLM4 model! (EricLBuehler#1437)

* Support GLM4 model

* Mention GLM4 model in ReadMe

* glm4 type hint

* Typo fix

* Fix unsupported chat_template function

* Clippy fix

* Refactor distributed backend (EricLBuehler#1484)

* Refactor distributed backend, check power of 2

* Fix compilation

* Cap metal paged attn kv allocation (EricLBuehler#1485)

* Better paged attn metal cap (EricLBuehler#1486)

* Better paged attn metal cap

* Small fix

* Comment

* Small fix

* Refactor

* Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423)

* Start working on consolidating completion and chat_completion underlying implementations

* Move response channel to util mod for now (since it's used with streaming and non streaming)

* More work on consolidating completions and chat completions

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* Update docs and restrict completion core visibility

* CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this

* Use consistent var name for completions mod

* Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub
Make lib.rs example compile checked and update example

* Code formatting

* Typo

* Sync fork

* Sync fork

* Docs example fix

* Support qwen3 gguf (EricLBuehler#1488)

* Add qwen3 gguf

* Template fixup

* Make bos/eos token IDs optional (EricLBuehler#1493)

* Remove python deps from CUDA dockerfiles (EricLBuehler#1487)

* Handle noncontiguous v in naive_sdpa (EricLBuehler#1499)

Co-authored-by: Eric Buehler <[email protected]>

* Server Core: refactor Paged Attention configuration (EricLBuehler#1500)

* Use StorageModePrivate for Metal PA kv cache (EricLBuehler#1506)

* Fix OpenAI stream: emit field in tool-call deltas for schema compliance (EricLBuehler#1507)

* FP8 KV-cache quantization for PagedAttention (EricLBuehler#1400)

* Add most of paged attn kv quant

* It builds a bit

* All the functionality at least

* Small fix

* Add a scale

* Fix bf16 usage

* Make k_v_scale optional

* Collector

* Tweak collection

* Refactor

* Add to apis

* Add cuda impl

* Fix compilation

* Fixes

* Handle ENABLE_FP8

* Format

* Tweak

* Fix scaled_convert usage

* Fix cache_t size

* Fixed scale collection

* Actual fix

* Fix fp8 for CC<8

* Fix the usual String != &str bit (EricLBuehler#1483)

Co-authored-by: RageLtMan <rageltman [at] sempervictus>

* chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465)

* chore: Dockerfile - Remove rayon threads env

* chore: Dockerfile - Improve formatting for `apt-get`

* Remove duplicate calls for api_dir_list (EricLBuehler#1474)

* Remove duplicate calls for api_dir_list

* Support local cache for api_dir_list

* Fix home folder for metal

* Capitalized

* Fix transient pyo3 dep (EricLBuehler#1478)

Co-authored-by: Eric Buehler <[email protected]>

* Fix objc dep with non macos (EricLBuehler#1480)

* Fix phi 3/4 + nccl issue (EricLBuehler#1481)

* Fix log

* Fix n kv heads

* Fix phi3.5 moe (EricLBuehler#1482)

* Fix phi3.5 moe accum device

* Fix again

* Fix again

* Support GLM4 model! (EricLBuehler#1437)

* Support GLM4 model

* Mention GLM4 model in ReadMe

* glm4 type hint

* Typo fix

* Fix unsupported chat_template function

* Clippy fix

* Refactor distributed backend (EricLBuehler#1484)

* Refactor distributed backend, check power of 2

* Fix compilation

* Cap metal paged attn kv allocation (EricLBuehler#1485)

* Better paged attn metal cap (EricLBuehler#1486)

* Better paged attn metal cap

* Small fix

* Comment

* Small fix

* Refactor

* Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423)

* Start working on consolidating completion and chat_completion underlying implementations

* Move response channel to util mod for now (since it's used with streaming and non streaming)

* More work on consolidating completions and chat completions

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* Update docs and restrict completion core visibility

* CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this

* Use consistent var name for completions mod

* Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub
Make lib.rs example compile checked and update example

* Code formatting

* Typo

* Sync fork

* Sync fork

* Docs example fix

* Support qwen3 gguf (EricLBuehler#1488)

* Add qwen3 gguf

* Template fixup

* Make bos/eos token IDs optional (EricLBuehler#1493)

* Remove python deps from CUDA dockerfiles (EricLBuehler#1487)

* Handle USE_FP8 for cuda

* Fix cuda warn

* Add readme

* Saturating sub in sequence state

---------

Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: RageLtMan <[email protected]>
Co-authored-by: Brennan Kinney <[email protected]>
Co-authored-by: Guoqing Bao <[email protected]>
Co-authored-by: Matthew Haynes <[email protected]>

* Validate model name in OpenAI API (EricLBuehler#1509)

* Validate model name in openai api

* Add docs, allow 'ignore'

* Updated examples for EricLBuehler#1509

* Fix mcp import in doc string (EricLBuehler#1510)

* Add multi-model support! (EricLBuehler#1512)

* Refactor MistralRs

* Working multi-model!

* Add mutli-model docs initially

* Update mistralrs-pyo3, mistralrs-bench, mistralrs

* Update apis for consistency

* API tweaks

* Logging tweaks

* Add examples, tweak cli

* Clearer pipeline id

* Fix config key semantics

* Format and clippy

* Tweak logging, fix example

* Clippy refactor

* Update examples

* Remove unused multi model docs

* Replace 'ignore' with 'default'

* Update docs

* Add stars label to readme (EricLBuehler#1513)

* Add CLAUDE.md

* Handle base_model.model case in lora (EricLBuehler#1514)

* Add thread_local! for engine-specific const/static (EricLBuehler#1517)

* Fix MCP doc test (EricLBuehler#1511)

* Allow disabling metal precompilation (EricLBuehler#1518)

* Allow disabling metal precompilation

* Simple preprocessor

* Simple docs

---------

Co-authored-by: Eric Buehler <[email protected]>

* Rust 1.88 clippy (EricLBuehler#1522)

* Rust 1.88 clippy

* Format

* Fix cuda warnings (EricLBuehler#1526)

* Avoid panic decoding tokens on error (EricLBuehler#1527)

* Split Marlin and Paged Attention kernels for faster build (EricLBuehler#1525)

* Split Marlin and Paged Attention kernels for faster build

* Typo fix

* chore: update llguidance (EricLBuehler#1535)

* chore: update llguidance

* chore: remove unused import

* Add the SmolLM3 model! (EricLBuehler#1501)

* Add model

* Update loader

* Fix llama config usage

* Docs

* Fix config no_rope_layers

* Fix tie_word_embeddings default

* Add chat template

* Embed the chat templates

* Fix embedding template

* enable_thinking default true

* Update examples

* XML tools for smollm3

* Add smollm3 docs

* Fix openai examples

* Clippy

---------

Co-authored-by: Eric Buehler <[email protected]>

* Add full Gemma 3n support! (EricLBuehler#1519)

* Add initial

* Loading for text model

* Add ple embeddings

* Add altup, laurel block

* Update rmsnorm

* Add mlp

* Update attn norm application

* Currently no kv shared

* Wire it up

* It runs

* Fix bf16

* Fix scaled embd

* Fixes for mean

* tmp

* Attn confirmed

* Fix target_magnitude

* Add shared kv

* Ok it works

* Remove npy

* Fix streaming

* Remove warnings

* Remove paged attn

* Refactor rope

* Add immediate isq

* Add vision & mproj

* Update image processor

* Vision merge runs, not correct

* Remove

* Add mobilenet v5

* Add multimodal vision embedding

* Fix load

* runs

* Fix gamma

* Works but just not vision tower

* It works!!

* Tweak

* Fix warnings

* Move vision tower

* Fix warn

* Update cache manager things

* Refactor

* Add audio model, it loads

* Add audio processing

* It runs at least

* tmp

* A bit better

* Audio works!!!!

* Fused attn in vision

* Clippy

* Update audio runner

* Optimized audio model

* Remove unused things

* Fix inputs processor bug

* Remove comments

* Clippy

* Small optimizations

* Format

* Correctly register modalities

* Add docs

* Update readme

* Runs there

* Fixed padding from Blaizzy/mlx-vlm#410

* Add better checks

* Fix sdpa n_kv_groups

* Vision encoder works!

* Rotate image

* Clippy

* Fix cuda loading

* Updated device mapper

* Fix overflow

* Fix dtype errors

* Refactor image/audio embeddings

* Fix metal

* Fix dtype mismatch

* Audio processing fixes

* Audio processing fixes

* Works

* Audio is good

* Fix boi/eoi too

* Embed the chat templates

* Better embedding accuracy in non f32

* More f32

* Support bf16 on metal

* Add more ISQ

* Fixed device map

* Clippy

* Gemma3n no paged attn

* Fix saturating sub

* Faster rmsnorm

* Use sdpa for vision model

* Fix ple bug

* Fix name

* Fix multiaudio

* Add matformer config loading

* Add docs

* Add support for matformer in auto device mapper

* Update docs

* Typos

* Tweak

* Tweak

* Fix multidevice

* Fix gemma3n text model auto device map

* Fix dims3

* Fix auto devic emap vision

* Non-metal keeps PLE on cpu

* Complete merge

* Vision dtype f16 -> f32

* Fix metal nm device

* Fix uqff

* Typos

* Reference uqff

* Fix tests

* Fix sequence length check (EricLBuehler#1546)

* update candle version (EricLBuehler#1545)

Co-authored-by: AlpineVibrations <[email protected]>

* add ios target to metal deps (EricLBuehler#1548)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: edwko <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Guoqing Bao <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: Chen Mulong <[email protected]>
Co-authored-by: Steph Wolski <[email protected]>
Co-authored-by: omahs <[email protected]>
Co-authored-by: Viktor Szépe <[email protected]>
Co-authored-by: Matthew Haynes <[email protected]>
Co-authored-by: RageLtMan <[email protected]>
Co-authored-by: Brennan Kinney <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: Sbargaoui <[email protected]>
Co-authored-by: Gaétan Lepage <[email protected]>
Co-authored-by: Ammar Elsabe <[email protected]>
Co-authored-by: luke <[email protected]>
Co-authored-by: AlpineVibrations <[email protected]>
Co-authored-by: Michael Tissen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants