Fast sampler #1327

EricLBuehler · 2025-05-11T13:44:40Z

Summary by CodeRabbit

New Features
- Added fast, GPU-accelerated sorting and cumulative sum (scan) operations for tensors on Metal devices, enabling high-performance argsort, sort, and cumsum along arbitrary axes.
- Introduced benchmarking script for load testing language model servers and reporting throughput statistics.
Improvements
- Enhanced quantization support: quantization configuration is now recognized using both "quantization_config" and "quantization" keys across all supported models, and is properly applied to language model head layers.
- Scheduler and engine now provide real-time logging of running and waiting sequence counts for improved monitoring.
- Optimized sampling and tensor device handling for improved performance on Metal backends.
- Expanded and refactored Metal GPU kernel suite to support efficient scan, sort, and copy operations for a wide range of data types.
Bug Fixes
- Improved option handling and deserialization for quantization configuration fields in model configs.
Style
- Code formatting and style improvements in CUDA and Python scripts for better readability and consistency.
Documentation
- Enhanced inline documentation and test coverage for new tensor operations.

coderabbitai · 2025-05-11T13:44:45Z

Walkthrough

This update introduces extensive improvements to the quantization and Metal backend infrastructure, including new GPU kernels for scan, sort, and copy operations, and exposes new fast tensor-based sorting and cumulative sum operations at the Rust level. The model initialization logic is refactored to propagate quantization configuration to language model head layers, and serde aliasing is added for quantization fields across numerous model configs. Scheduler interfaces are updated to integrate interval logging of running and waiting sequences. Additional code style and refactoring changes improve clarity, consistency, and maintainability throughout the codebase.

Changes

File(s)	Change Summary
`Cargo.toml`	Adds `[profile.release]` and `[profile.dev]` sections with LTO and optimization settings.
`examples/server/chat.py`	Changes OpenAI client base URL from port 1234 to 8080.
`scripts/bench.py`	Adds a new async benchmarking script for load testing a language model server.
`scripts/convert_awq_marlin.py`	Code style and formatting improvements; no logic changes.
`mistralrs-core/src/engine/logger.rs`	Adds atomic counters and setter methods for running/waiting sequence logging in `IntervalLogger`.
`mistralrs-core/src/engine/mod.rs`	Makes `IntervalLogger` public and updates scheduler call to accept a logger reference.
`mistralrs-core/src/dummy_paged_attention/scheduler.rs`, `mistralrs-core/src/paged_attention/scheduler.rs`, `mistralrs-core/src/scheduler/default_scheduler.rs`, `mistralrs-core/src/scheduler/mod.rs`	Updates all scheduler `schedule` methods and trait to accept a logger reference and log running/waiting counts.
`mistralrs-core/src/kv_cache/rotating_cache.rs`, `mistralrs-core/src/kv_cache/single_cache.rs`	Changes `all_data` methods to return `Option<&Tensor>` instead of `&Option<Tensor>`.
`mistralrs-core/src/models/*` (multiple files)	Adds serde alias `"quantization"` to quantization config fields; propagates quantization config to `lm_head` layer construction.
`mistralrs-core/src/vision_models/*` (multiple files)	Adds serde alias `"quantization"` to quantization config fields and passes quantization config to `lm_head` layer.
`mistralrs-core/src/pipeline/mod.rs`	Removes `to_device` from `ForwardInputsResult` and simplifies logits collection logic.
`mistralrs-core/src/pipeline/ggml.rs`, `mistralrs-core/src/pipeline/gguf.rs`, `mistralrs-core/src/pipeline/normal.rs`, `mistralrs-core/src/pipeline/vision.rs`	Refactors chat template path extraction to use local variables for clarity.
`mistralrs-core/src/pipeline/macros.rs`	Updates macros to pass `Option<&T>` instead of `&Option<T>` for optional parameters.
`mistralrs-core/src/pipeline/paths.rs`	Changes function signatures to use `Option<&T>` for optional parameters and updates internal logic accordingly.
`mistralrs-core/src/sampler.rs`	Adds `sample_fast` method for fast tensor-based sampling; uses it under the `metal` feature.
`mistralrs-quant/build.rs`	Expands Metal source/header file list and updates build script to handle all headers in compilation steps.
`mistralrs-quant/src/lib.rs`, `mistralrs-quant/src/utils/mod.rs`	Re-exports new operators: `CumSumOp` and `SortOp`.
`mistralrs-quant/src/utils/ops.rs`	Adds fast tensor-based `Sort`, `ArgSort`, and `CumSum` operations with Metal backend support and tests.
`mistralrs-quant/src/safetensors.rs`	Simplifies dtype handling in tensor loading logic.
`mistralrs-quant/src/metal_kernels/utils.rs`	Adds helpers for grid and block dimension calculations for Metal kernels.
`mistralrs-quant/src/metal_kernels/utils.metal`	Adds extensive numeric, indexing, and SIMD utility templates for Metal kernels.
`mistralrs-quant/src/metal_kernels/bitwise.metal`	Adjusts parameter indentation for style consistency.
`mistralrs-quant/src/metal_kernels/quantized.metal`	Removes unused utility templates and optimizes pointer arithmetic in `qmv_fast_impl`.
`mistralrs-quant/src/metal_kernels/bf16.metal`	New: Adds bfloat16 type support and arithmetic for Metal kernels.
`mistralrs-quant/src/metal_kernels/scan.metal`, `mistralrs-quant/src/metal_kernels/scan_impl.metal`	New: Adds GPU scan (prefix sum/product/max/min/logaddexp) kernel instantiations and implementations for Metal.
`mistralrs-quant/src/metal_kernels/sort.metal`, `mistralrs-quant/src/metal_kernels/sort_impl.metal`	New: Adds GPU block and multi-block sorting kernel instantiations and implementations for Metal.
`mistralrs-quant/src/metal_kernels/copy.metal`, `mistralrs-quant/src/metal_kernels/copy_impl.metal`	New: Adds comprehensive GPU copy kernel instantiations and implementations for all type combinations and dimensionalities.
`mistralrs-quant/src/metal_kernels/mod.rs`	Adds Rust-side dispatch logic for Metal scan and sort kernels, including caching and kernel launch helpers.
`mistralrs-quant/kernels/marlin/marlin_kernel.cu`	Code style and formatting improvements; no logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler
    participant Logger
    participant Engine

    Engine->>Scheduler: schedule(&Logger)
    Scheduler->>Logger: set_num_running(running_count)
    Scheduler->>Logger: set_num_waiting(waiting_count)
    Scheduler-->>Engine: SchedulerOutput

sequenceDiagram
    participant Tensor
    participant MetalBackend
    participant User

    User->>Tensor: .fast_sort_asc(axis)
    Tensor->>MetalBackend: launch sort kernel
    MetalBackend-->>Tensor: sorted tensor
    Tensor-->>User: sorted tensor

sequenceDiagram
    participant Tensor
    participant MetalBackend
    participant User

    User->>Tensor: .fast_cumsum(axis)
    Tensor->>MetalBackend: launch scan kernel
    MetalBackend-->>Tensor: cumsum tensor
    Tensor-->>User: cumsum tensor

Poem

In fields of code where kernels grow,
Metal bunnies sort and scan below.
Quantization configs hop with glee,
Through serde aliases, wild and free!
Schedulers now log their queue,
And sampling's faster, thanks to you.
🐇✨—the code hops on, renewed!

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-05-11T13:45:36Z

Code Metrics Report

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           62           53            0            9
 Dockerfile              1           41           22           10            9
 JSON                   12          107          106            0            1
 Makefile                1            6            5            0            1
 Python                 86         4042         3413          156          473
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   19          565          518            6           41
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               55         5012            0         3822         1190
 |- BASH                 8          104          101            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                22          757          634            1          122
 |- TOML                 2           75           63            0           12
 (Total)                           6081          919         3823         1339
-------------------------------------------------------------------------------
 Rust                  378       126908       113288         2593        11027
 |- Markdown           171         2145           29         1913          203
 (Total)                         129053       113317         4506        11230
===============================================================================
 Total                 564       140550       117450         9020        14080
===============================================================================

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (11)

mistralrs-server/src/interactive_mode.rs (1)

242-245: Start TTFT timer before send for accuracy (optional)

The timer is initialised after the request is sent.
If queueing on the Tokio channel becomes significant, TTFT will be underestimated.
Consider moving let start_ttft = Instant::now(); immediately before sender.send(req).await?;.

mistralrs-quant/src/metal_kernels/mod.rs (1)

1176-1199: Unused variable _bm and potential overflow in grid calc

let _bm = 32; is never used – remove or employ it.

Additionally, multiplying tmp_grid_dims.{width|height} by stride_blocks can overflow u64
for very large tensors. Use checked_mul and fall back to the alternate dimension on overflow.

mistralrs-core/src/sampler.rs (1)

353-366: Top-K threshold broadcasting may mis-align for batched logits

unsqueeze(0) produces a shape [1, vocab].
For batched inputs [batch, vocab] this works, but for a higher-rank tensor the broadcast is ambiguous.
Use unsqueeze(D::Minus1) (or keep the original dim count via expand) to guarantee alignment with the last dimension irrespective of rank.
mistralrs-quant/src/utils/ops.rs (2)
1011-1020: Wrong error message and unreachable match arm

The fallback arm returns
Err(Error::UnsupportedDTypeForOp(DType::F32, "cumsum"))
even for many other dtypes, making debugging harder.

Replace with the actual s1.dtype():
- _ => Err(Error::UnsupportedDTypeForOp(DType::F32, "cumsum")),
+ _ => Err(Error::UnsupportedDTypeForOp(s1.dtype(), "cumsum")),
904-909: Naming & visibility

CumSum is private but the trait CumSumOp relies on users calling fast_cumsum[_config].
Consider making the struct pub(crate) and prefixing internal helpers with _ to clarify intent.
mistralrs-quant/src/metal_kernels/bf16.metal (1)

10-13: Duplicate typedef of bfloat16_t

Lines 10 and 12 both declare typedef bfloat bfloat16_t;.
Remove the second to avoid “redefinition” warnings.

mistralrs-quant/src/metal_kernels/scan.metal (1)

52-64: Macro explosion may hit Metal compiler limits.

The 𝟠×𝟡 explicit instantiate_scan_helper(...) calls generate 70+ specialized kernels per op (sum/prod/…) and per layout.
Several Apple-silicon tool-chains start failing once the number of functions in a single .metal file approaches ~1 000 because the internal IR hits the 64 Ki symbol table limit.
Consider splitting large groups into separate translation units or gate‐instantiate with #ifdefs that match the actual runtime usage set to keep compile time and binary size reasonable.
mistralrs-core/src/prefix_cacher.rs (1)
110-152: Potentially expensive full-vector clone in hot path.

seq.normal_cache().to_vec() and seq.image_hashes().map(|v| v.to_vec()) allocate and copy on every call even when the cache entry for the sequence already exists.
You can avoid the allocation for updates by checking if nb.cache.is_none() before cloning:
let (data, img_hashes) = if nb.cache.is_none() {
 (seq.normal_cache().to_vec(),
 seq.image_hashes().map(|v| v.to_vec()))
} else {
 (Vec::new(), None) // nothing needed – will be overwritten below
};
mistralrs-quant/src/metal_kernels/utils.metal (2)
12-16: Duplicate typedef – remove the extra declaration.

typedef bfloat bfloat16_t; appears twice back-to-back which may trigger a
“redefinition” warning on stricter Metal compilers.
-typedef bfloat bfloat16_t;
-typedef bfloat bfloat16_t;
+typedef bfloat bfloat16_t; // single definition is enough
1151-1163: Logical functors return T but evaluate to bool.

LogicalAnd / LogicalOr compute x && y / x || y yet return a value of
type T. For non-boolean T (e.g. float, int, half) this relies on
implicit conversion from bool and silently narrows the result to 0 or 1.
Returning bool clarifies semantics and avoids accidental use in arithmetic
code where a full-precision T is expected.
-template <typename T> T operator()(T x, T y) { return x && y; }
+template <typename T> bool operator()(T x, T y) { return x && y; }
Apply similarly to LogicalOr.
mistralrs-quant/src/metal_kernels/scan_impl.metal (1)
121-140: CumLogaddexp needs numerical stabilisation

LogAddExp for large negative inputs can underflow to -∞, breaking the
scan when the tensor contains a wide dynamic range. Consider the classic
stable implementation:
U operator()(U a, T b) {
 U m = max(a, static_cast(b));
 return m + log(exp(a - m) + exp(static_cast(b) - m));
}
If LogAddExp{} already implements this, document it; otherwise replace the
call with a numerically stable version.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b540ea and 089b617.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (16)

mistralrs-core/Cargo.toml (1 hunks)
mistralrs-core/src/pipeline/mod.rs (1 hunks)
mistralrs-core/src/prefix_cacher.rs (2 hunks)
mistralrs-core/src/sampler.rs (4 hunks)
mistralrs-quant/build.rs (3 hunks)
mistralrs-quant/src/lib.rs (1 hunks)
mistralrs-quant/src/metal_kernels/bf16.metal (1 hunks)
mistralrs-quant/src/metal_kernels/mod.rs (2 hunks)
mistralrs-quant/src/metal_kernels/quantized.metal (0 hunks)
mistralrs-quant/src/metal_kernels/scan.metal (1 hunks)
mistralrs-quant/src/metal_kernels/scan_impl.metal (1 hunks)
mistralrs-quant/src/metal_kernels/utils.metal (2 hunks)
mistralrs-quant/src/metal_kernels/utils.rs (1 hunks)
mistralrs-quant/src/utils/mod.rs (1 hunks)
mistralrs-quant/src/utils/ops.rs (2 hunks)
mistralrs-server/src/interactive_mode.rs (6 hunks)

💤 Files with no reviewable changes (1)

mistralrs-quant/src/metal_kernels/quantized.metal

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Test Suite (macOS-latest, stable)
GitHub Check: Docs
GitHub Check: Test Suite (ubuntu-latest, stable)
GitHub Check: Test Suite (windows-latest, stable)
GitHub Check: Check (macOS-latest, stable)
GitHub Check: Clippy
GitHub Check: Check (windows-latest, stable)
GitHub Check: Check (ubuntu-latest, stable)

🔇 Additional comments (7)

mistralrs-core/Cargo.toml (1)

92-94: Good additions for the new radix trie implementation.

The three new dependencies added (hashbrown, parking_lot, and ahash) are appropriate choices for implementing the concurrent radix trie with LRU eviction mechanism. These libraries are well-maintained, provide good performance characteristics, and are commonly used together for high-performance concurrent data structures.

mistralrs-quant/src/utils/mod.rs (1)

8-8: Appropriate public export for the new CumSumOp trait.

The CumSumOp trait is correctly added to the list of publicly exposed operations from the module. This change aligns with the module's existing pattern of re-exporting operation traits.

mistralrs-quant/src/lib.rs (1)

59-59: Consistent crate-level re-export of the new operation.

The CumSumOp trait is properly added to the crate-level public exports alongside other operation traits, making it accessible to users of the library without requiring them to import from specific submodules.
mistralrs-core/src/pipeline/mod.rs (1)

423-427:
❓ Verification inconclusive

Simplified logits processing by removing explicit CPU device conversion.

The code has been simplified by removing the .to_device(&Device::Cpu) call and the associated error handling, which may improve performance by avoiding unnecessary device transfers. However, there's an inconsistency since the similar code in the PagedAttention branch (lines 564-570) still performs this conversion.

Please verify that operations downstream don't require the tensors to be on the CPU device. The inconsistency between this branch and the PagedAttention branch suggests there might be specific scenarios where device placement matters.

🏁 Script executed:
#!/bin/bash
# Find where the logits are used in sample_causal_gen to check if they require CPU device
rg -p "fn sample_causal_gen.*?\{\s*.*?\}" --multiline mistralrs-core/src/
Length of output: 75

I’ll inspect the surrounding code to locate both branches and any remaining .to_device(&Device::Cpu) calls.
#!/bin/bash
set -euxo pipefail

# 1. Show the logits processing snippet (around lines 423–427)
sed -n '350,450p' mistralrs-core/src/pipeline/mod.rs

# 2. Show the PagedAttention branch (around lines 564–570)
sed -n '550,650p' mistralrs-core/src/pipeline/mod.rs

# 3. Find all remaining `.to_device(&Device::Cpu)` occurrences
rg -n ".to_device\s*$\s*&Device::Cpu\s*$" -n mistralrs-core/src/pipeline/mod.rs
To pinpoint the discrepancy and verify if CPU placement is still used elsewhere, let’s dump both branches and search for any remaining .to_device(&Device::Cpu) calls:
#!/bin/bash
set -euxo pipefail

# 1. Show logits processing around lines 420–430
sed -n '410,440p' mistralrs-core/src/pipeline/mod.rs

# 2. Show PagedAttention branch around lines 560–580
sed -n '550,590p' mistralrs-core/src/pipeline/mod.rs

# 3. List all occurrences of to_device(&Device::Cpu)
rg -n "to_device\s*$\s*&Device::Cpu\s*$" mistralrs-core/src/pipeline/mod.rs || true
🏁 Script executed:
#!/usr/bin/env bash
set -euxo pipefail

# 1. Show logits processing around lines 420–430
sed -n '410,440p' mistralrs-core/src/pipeline/mod.rs

# 2. Show PagedAttention branch around lines 560–580
sed -n '550,590p' mistralrs-core/src/pipeline/mod.rs

# 3. List all occurrences of to_device(&Device::Cpu)
rg -n "to_device\s*$\s*&Device::Cpu\s*$" mistralrs-core/src/pipeline/mod.rs || true
Length of output: 3679
Inconsistent .to_device(&Device::Cpu) usage in logits processing

The normal CausalGeneration path (around lines 423–427) now drops the .to_device(&Device::Cpu) call, but the PagedAttention branch (line 568) still performs it—and there are other conversions at lines 384 and 540. Please verify whether downstream routines (e.g. sample_causal_gen) require CPU-resident tensors. Then unify both branches:

• Remove .to_device(&Device::Cpu) in the PagedAttention branch (and at lines 384/540) if GPU tensors are acceptable
• Otherwise, re-introduce the CPU conversion in the simplified branch

Files/locations to review:

mistralrs-core/src/pipeline/mod.rs: lines 423–427

mistralrs-core/src/pipeline/mod.rs: line 568

mistralrs-core/src/pipeline/mod.rs: lines 384, 540
mistralrs-quant/build.rs (1)

170-174: Compiling header files may duplicate symbols

Header sources are now passed to the Metal compiler as standalone translation units:
for src in HEADER_SOURCES {
 println!("cargo:rerun-if-changed=src/metal_kernels/{src}.metal");
}
and later:
for metal_file in HEADER_SOURCES {
 compile_air_cmd.arg(sources.join(format!("{metal_file}.metal")));
}
If these headers define templated kernels or helper functions without static linkage
they may be emitted twice, producing duplicate-symbol errors at link time.

Consider adding them with -I/-include instead of compiling them directly.
mistralrs-quant/src/metal_kernels/bf16.metal (1)

47-56: Heavy use of C++17 type-traits may not compile under Metal

Metal shading language roughly aligns with C++14; helpers such as is_convertible_v, is_same_v and variable templates are C++17.
Unless you compile with -std=c++17-metal (not universally available), this section will fail.

If wide portability is desired, fall back to std::is_same<…, bool>::value + enable_if_t.
mistralrs-core/src/prefix_cacher.rs (1)

155-182: evict_caches can dereference a None head – guard early.

If current_on_device > n_on_device gets out of sync (e.g. manual current_on_device mutation or future parallelisation), lru_head might be None, leading to an infinite loop (while condition never falsifies and head is None).
Add an assert or early-return before the loop:
if self.lru_head.is_none() {
 debug_assert_eq!(self.current_on_device, 0, "Inconsistent LRU list");
 break;
}

coderabbitai · 2025-05-11T13:48:23Z

mistralrs-server/src/interactive_mode.rs

+            if let Some(ttft) = first_token_duration {
+                println!("Time to first token: {:.2?}s", ttft.as_secs_f32());
+            }


⚠️ Potential issue

Invalid format string – code will not compile

println!("Time to first token: {:.2?}s", ttft.as_secs_f32());

{:?} is the Debug formatter, which cannot be combined with a precision specifier (.2).
The compiler will emit:

format argument must not be followed by `?`

Replace the debug specifier with the normal Display formatter:

-println!("Time to first token: {:.2?}s", ttft.as_secs_f32()); +println!("Time to first token: {:.2}s", ttft.as_secs_f32());

Apply the same change in the vision branch (lines 550-552).

coderabbitai · 2025-05-11T13:48:23Z

mistralrs-quant/src/metal_kernels/mod.rs

+        let size = shape[axis];
+        let stride = strides[size];
+        let _bm = 32;
+        let bn = 32;
+        let stride_blocks = stride.div_ceil(bn);
+
+        encoder.set_buffer(0, Some(xs), xs_offset as u64);
+        encoder.set_buffer(1, Some(output), 0);
+
+        encoder.set_bytes(
+            2,
+            std::mem::size_of::<usize>() as u64,
+            &size as *const usize as *const _,
+        );
+        encoder.set_bytes(


⚠️ Potential issue

Array-index bug: using size instead of axis

let stride = strides[size];

size is the length of the scan dimension, not its index.
Accessing strides[size] will:

Panic with index out of bounds when size ≥ strides.len().

Read an unrelated stride when the scan dimension length happens to be a valid index.

Fix:

- let stride = strides[size]; + let stride = strides[axis];

Add a unit test with a non-trivial stride to catch this class of error.

coderabbitai · 2025-05-11T13:48:23Z

mistralrs-quant/src/metal_kernels/utils.rs

+        if divisor > 1 {
+            if grid_x % divisor == 0 {
+                grid_x /= divisor;
+                divisor = 1;
+            } else if grid_y % divisor == 0 {
+                grid_y /= divisor;
+                divisor = 1;
+            }
+        }


🛠️ Refactor suggestion

Remaining divisor not validated

After the loop divisor may still be > 1, meaning the requested factorisation failed
silently. This produces wrong grid sizes and hard-to-trace GPU crashes.

if divisor > 1 { if grid_x % divisor == 0 { grid_x /= divisor; divisor = 1; } else if grid_y % divisor == 0 { grid_y /= divisor; divisor = 1; } } + + // Ensure the divisor was fully factored out + assert_eq!(divisor, 1, "Unable to factor divisor from shape");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if divisor > 1 {

if grid_x % divisor == 0 {

grid_x /= divisor;

divisor = 1;

} else if grid_y % divisor == 0 {

grid_y /= divisor;

divisor = 1;

}

}

if divisor > 1 {

if grid_x % divisor == 0 {

grid_x /= divisor;

divisor = 1;

} else if grid_y % divisor == 0 {

grid_y /= divisor;

divisor = 1;

}

}

// Ensure the divisor was fully factored out

assert_eq!(divisor, 1, "Unable to factor divisor from shape");

coderabbitai · 2025-05-11T13:48:24Z

mistralrs-quant/src/metal_kernels/mod.rs

+    }
+
+    let type_name = match ty {
+        DType::F32 => "float32",
+        DType::BF16 => "bfloat16",
+        DType::F16 => "float16",
+        DType::U8 => "uint8",
+        DType::I16 => "int16",
+        DType::I32 => "int32",
+        DType::I64 => "int64",
+        other => {
+            return Err(MetalKernelError::DTypeMismatch {
+                expected: vec![
+                    DType::F32,
+                    DType::F16,
+                    DType::BF16,
+                    DType::U8,
+                    DType::I16,
+                    DType::I32,
+                    DType::I64,
+                ],
+                got: other,
+            })
+        }
+    };
+    name.push_str(&format!("{type_name}_{type_name}"));
+


🛠️ Refactor suggestion

LogAddExp on integer dtypes is undefined

ScanType::LogAddExp is appended without restricting to floating-point types.
Metal kernels will likely perform illegal operations on u8/i16/i32/i64.

Guard the match:

if matches!(op, ScanType::LogAddExp) && !ty.is_float() { return Err(MetalKernelError::DTypeMismatch { … }); }

or restrict expected accordingly.

mistralrs-core/src/sampler.rs

coderabbitai · 2025-05-11T13:48:25Z