Skip to content

Add the conformer backbone (phi4mm audio) #1448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Jun 9, 2025
Merged

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented Jun 7, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a Conformer vision model encoder with configurable attention, feed-forward, convolutional, and positional encoding components.
    • Added batch normalization and 1D convolution layer support.
    • Implemented convolutional subsampling and both absolute and relative positional embeddings.
    • Enabled modular configuration and integration of Conformer-based models within the vision model suite.
    • Added audio embedding support for the Phi4 multimodal model, including configurable projection modes and integration with the Conformer encoder.
    • Expanded model configuration to include audio processing and embedding parameters for enhanced multimodal capabilities.
    • Extended multimodal model to merge and process image and audio embeddings with attention masking.
    • Enhanced input processing to detect and extract audio features from audio tokens, integrating audio embeddings and attention masks into the multimodal pipeline.
    • Updated prompt prefixing to support audio tokens alongside image tokens, renaming the trait to MultimodalPromptPrefixer and adding audio prefixing methods.
    • Added support for audio inputs throughout the system including request handling, sequence management, interactive mode, and message building.
    • Improved caching and matching logic to consider audio hashes alongside tokens and images.
    • Added audio loading from URLs, file paths, and data URLs with format support and error handling.
    • Extended the web chat interface and API with audio upload, preview, and playback capabilities.
    • Provided new Python and Rust examples demonstrating multimodal chat with audio and image inputs.
  • Chores

    • Added several new dependencies to support audio and signal processing functionalities.
    • Adjusted logging to suppress verbose output from the audio decoding library.

Copy link

coderabbitai bot commented Jun 7, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a Conformer vision model encoder to the codebase. It adds configuration, encoder, convolutional subsampling, and positional embedding modules, along with new dependencies for audio and signal processing. The encoder implements attention, feed-forward, and convolutional layers, supporting advanced features like relative attention bias and chunked sequence processing. Additionally, audio embedding support for the Phi4 multimodal model is added, integrating the Conformer encoder and new configuration options. The audio processing pipeline is enhanced to extract log-mel spectrogram features and integrate audio embeddings alongside image embeddings. The pipeline and prefixer traits are extended to support multimodal prefixing with audio.

Changes

File(s) Change Summary
Cargo.toml, mistralrs-core/Cargo.toml Added new workspace member mistralrs-audio; added dependencies rubato, rustfft, hound, apodize, and symphonia with audio features.
mistralrs-audio/* New crate providing audio utilities: audio reading, decoding, resampling, and mel spectrogram feature extraction.
mistralrs-core/src/layers.rs Added constructors for batch normalization and 1D convolution layers; extended imports.
mistralrs-core/src/vision_models/conformer/config.rs Introduced configuration structs and logic for the Conformer encoder, including default handling and finalization.
mistralrs-core/src/vision_models/conformer/encoder.rs Implemented the Conformer encoder: attention, feed-forward, convolution, chunking, and supporting tests.
mistralrs-core/src/vision_models/conformer/mod.rs Declared and exposed submodules: config, encoder, nemo, pos_embed.
mistralrs-core/src/vision_models/conformer/nemo.rs Added NemoConvSubsampling: convolutional subsampling module with mask handling.
mistralrs-core/src/vision_models/conformer/pos_embed.rs Added absolute and T5-style relative positional embedding modules.
mistralrs-core/src/vision_models/mod.rs Exposed the new conformer module at the vision model crate level.
mistralrs-core/src/vision_models/phi4/audio_embedding.rs Added audio embedding module integrating Conformer encoder and projection layers for speech and vision modes.
mistralrs-core/src/vision_models/phi4/config.rs Added audio embedding and processor configuration structs; integrated Conformer encoder config into Phi4MMConfig.
mistralrs-core/src/vision_models/phi4/mm_embedding.rs Extended Phi4MMImageAudioEmbedding to optionally include audio embedding initialized from config.
mistralrs-core/src/vision_models/phi4/mod.rs Added audio_embedding module declaration and extended forward method to support audio embeddings.
mistralrs-core/src/vision_models/phi4/inputs_processor.rs Added audio processing fields and methods to extract log-mel spectrogram features; integrated audio embeddings and masks into model inputs.
mistralrs-core/src/vision_models/preprocessor_config.rs Added optional audio processing configuration fields: audio compression rate, downsample rate, and feature stride.
mistralrs-core/src/lib.rs Replaced export of VisionPromptPrefixer with MultimodalPromptPrefixer; added export of AudioInput.
mistralrs-core/src/pipeline/loaders/vision_loaders.rs Replaced VisionPromptPrefixer with MultimodalPromptPrefixer in trait and implementations; added audio prefixing method.
mistralrs-core/src/pipeline/mod.rs Renamed VisionPromptPrefixer to MultimodalPromptPrefixer; added prefix_audio method; updated enum variant type.
mistralrs-core/src/pipeline/vision.rs Updated VisionPipeline prefixer field type to MultimodalPromptPrefixer.
mistralrs-core/src/request.rs Added audios field to RequestMessage::VisionChat; introduced AudioInput struct with WAV reading and bytes decoding support.
mistralrs-core/src/sequence.rs Added SequenceAudios struct and extended MultimodalData and Sequence structs to support audio inputs.
mistralrs-core/src/engine/add_request.rs Propagated audios field in VisionChat request handling.
mistralrs-core/src/pipeline/amoe.rs Added an argument None to Sequence::new_waiting call (minor fix).
mistralrs-core/src/prefix_cacher.rs Extended prefix cache to store and match audio hashes alongside tokens and image hashes.
mistralrs-core/src/utils/debug.rs Added log level override to suppress info-level logs from the symphonia crate.
mistralrs-server-core/src/util.rs Added async function to parse audio URLs or data URLs and load audio data into AudioInput.
mistralrs-server-core/src/chat_completion.rs Added empty audios vector initialization in vision chat request parsing.
mistralrs-pyo3/src/lib.rs Added empty audios vector initialization in vision chat request creation.
mistralrs-server/src/interactive_mode.rs Extended vision interactive mode to detect and process audio URLs; load audio inputs and include in requests.
mistralrs/src/messages.rs Extended VisionMessages and RequestBuilder to support audio messages with prefixing; updated message extraction.
examples/python/custom_tool_call.py Formatting and whitespace improvements only; no logic changes.
.typos.toml Added "seperable" and "Seperable" to ignored typo identifiers.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Encoder
    participant NemoConvSubsampling
    participant PositionalEncoding
    participant EncoderLayer
    participant Attention
    participant FeedForward
    participant ConvModule

    User->>Encoder: forward(xs, mask)
    Encoder->>NemoConvSubsampling: forward(xs, mask)
    NemoConvSubsampling-->>Encoder: xs_sub, mask_sub
    Encoder->>PositionalEncoding: forward(xs_sub)
    PositionalEncoding-->>Encoder: xs_pos
    loop For each EncoderLayer
        Encoder->>EncoderLayer: forward(xs_pos, mask_sub, rel_bias)
        EncoderLayer->>FeedForward: forward(xs_pos)
        FeedForward-->>EncoderLayer: xs_ff
        EncoderLayer->>Attention: forward(xs_ff, mask_sub, rel_bias)
        Attention-->>EncoderLayer: xs_attn
        EncoderLayer->>ConvModule: forward(xs_attn)
        ConvModule-->>EncoderLayer: xs_conv
        EncoderLayer-->>Encoder: xs_out
    end
    Encoder-->>User: output, updated_mask
Loading

Suggested labels

codex

Poem

A Conformer hops into the code,
With layers deep and modules owed.
Attention’s sharp, convolutions stride,
Positional signals by its side.
Now vision learns with chunk and mask,
And audio joins the task!
A rabbit’s work—no simple task! 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 167f1b5 and dd9031b.

📒 Files selected for processing (29)
  • README.md (1 hunks)
  • docs/PHI4MM.md (2 hunks)
  • examples/python/phi4mm_audio.py (1 hunks)
  • examples/server/phi4mm_audio.py (1 hunks)
  • mistralrs-core/src/lib.rs (4 hunks)
  • mistralrs-core/src/pipeline/diffusion.rs (2 hunks)
  • mistralrs-core/src/pipeline/ggml.rs (2 hunks)
  • mistralrs-core/src/pipeline/gguf.rs (2 hunks)
  • mistralrs-core/src/pipeline/loaders/vision_loaders.rs (30 hunks)
  • mistralrs-core/src/pipeline/mod.rs (5 hunks)
  • mistralrs-core/src/pipeline/normal.rs (2 hunks)
  • mistralrs-core/src/pipeline/speech.rs (2 hunks)
  • mistralrs-core/src/pipeline/vision.rs (3 hunks)
  • mistralrs-core/src/utils/progress.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
  • mistralrs-pyo3/README.md (1 hunks)
  • mistralrs-pyo3/src/lib.rs (6 hunks)
  • mistralrs-pyo3/src/util.rs (2 hunks)
  • mistralrs-server-core/src/chat_completion.rs (6 hunks)
  • mistralrs-web-chat/README.md (1 hunks)
  • mistralrs-web-chat/src/handlers/api.rs (1 hunks)
  • mistralrs-web-chat/src/handlers/websocket.rs (10 hunks)
  • mistralrs-web-chat/src/main.rs (2 hunks)
  • mistralrs-web-chat/static/index.html (1 hunks)
  • mistralrs-web-chat/static/js/ui.js (1 hunks)
  • mistralrs-web-chat/static/js/utils.js (2 hunks)
  • mistralrs-web-chat/static/js/websocket.js (4 hunks)
  • mistralrs/examples/phi4mm_audio/main.rs (1 hunks)
  • mistralrs/src/messages.rs (12 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

github-actions bot commented Jun 7, 2025

Code Metrics Report
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           62           53            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           42           23           10            9
 HTML                    1           73           61            4            8
 JavaScript              7         1248          936          174          138
 JSON                   14          123          122            0            1
 Makefile                1            6            5            0            1
 Python                 88         4164         3518          161          485
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   21          765          703           11           51
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               60         5245            0         4009         1236
 |- BASH                11          123          117            2            4
 |- JSON                 2           42           42            0            0
 |- Python               7          121          109            0           12
 |- Rust                22          757          634            1          122
 |- TOML                 2           75           63            0           12
 (Total)                           6363          965         4012         1386
-------------------------------------------------------------------------------
 Rust                  377       132662       118026         2922        11714
 |- Markdown           176         2993           25         2660          308
 (Total)                         135655       118051         5582        12022
===============================================================================
 Total                 583       148670       123900         9738        15032
===============================================================================

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (6)
mistralrs-core/src/vision_models/conformer/pos_embed.rs (2)

49-51: Provide more informative error message

Include the actual sequence length and maximum supported length in the error message for better debugging.

-if xs.dim(1)? >= self.pe.dim(1)? {
-    candle_core::bail!("Need to recompute positional embeds");
-}
+if xs.dim(1)? > self.pe.dim(1)? {
+    candle_core::bail!(
+        "Sequence length {} exceeds maximum positional encoding length {}",
+        xs.dim(1)?,
+        self.pe.dim(1)?
+    );
+}

102-106: Clarify unimplemented bucketing path

The bucketing path is not implemented. Consider adding a more descriptive error message or implementing the feature if needed.

-unimplemented!("require skip_bucketing");
+unimplemented!("Bucketing for relative attention bias is not yet implemented. Set skip_bucketing=true to use direct indexing.");
mistralrs-core/src/vision_models/conformer/nemo.rs (2)

32-32: Remove unused variables

These variables are defined but never used in the code.

-let ceil_mode = false;
-let left_padding = (kernel_size - 1) / 2;
-let right_padding = (kernel_size - 1) / 2;
+let padding = (kernel_size - 1) / 2;

Then update the Conv2dConfig to use padding instead of left_padding.

Also applies to: 37-38


126-145: Add documentation for length calculation

This helper method would benefit from documentation explaining the convolution output size formula.

Add a doc comment above the method:

/// Calculates the output length after applying convolution layers.
/// 
/// Uses the formula: output_length = floor((input_length + padding - kernel_size) / stride + 1)
/// Applied recursively for `repeat_num` layers.
fn calc_length(
mistralrs-core/src/vision_models/conformer/encoder.rs (2)

571-571: Make max_seq_len configurable

The maximum sequence length is hardcoded to 500. Consider making this configurable through the encoder configuration for flexibility.

-        let max_seq_len = 500;
+        let max_seq_len = self.cfg.max_seq_len.unwrap_or(500);

This would require adding a max_seq_len: Option<usize> field to the ConformerEncoderConfig struct and storing the config in the Encoder struct.


625-625: Avoid unnecessary clone for short sequences

The clone operation is unnecessary when the sequence doesn't need unfolding.

-        return Ok(xs_pad.clone());
+        return Ok(xs_pad.shallow_clone());

Using shallow_clone avoids copying the underlying data and only creates a new tensor view.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2532852 and 3853cfe.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • mistralrs-core/Cargo.toml (1 hunks)
  • mistralrs-core/src/layers.rs (3 hunks)
  • mistralrs-core/src/vision_models/conformer/config.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/mod.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/nemo.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs (1 hunks)
  • mistralrs-core/src/vision_models/mod.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/vision_models/conformer/encoder.rs (4)
mistralrs-core/src/diffusion_models/flux/model.rs (1)
  • attention (79-84)
mistralrs-core/src/vision_models/conformer/pos_embed.rs (4)
  • new (15-46)
  • new (65-84)
  • forward (48-54)
  • forward (86-117)
mistralrs-core/src/vision_models/conformer/nemo.rs (2)
  • new (18-124)
  • forward (147-181)
mistralrs-core/src/layers.rs (31)
  • new (209-212)
  • new (251-256)
  • new (280-286)
  • new (530-561)
  • new (981-997)
  • new (1058-1074)
  • new (1274-1301)
  • new (1456-1466)
  • new (1585-1599)
  • new (1617-1637)
  • new (1666-1680)
  • new (1765-1791)
  • new (1990-2025)
  • linear (178-182)
  • forward (239-241)
  • forward (264-270)
  • forward (288-290)
  • forward (579-654)
  • forward (963-970)
  • forward (1039-1048)
  • forward (1115-1124)
  • forward (1303-1338)
  • forward (1484-1534)
  • forward (1601-1608)
  • forward (1639-1656)
  • forward (1740-1754)
  • forward (1821-1891)
  • layer_norm (56-69)
  • linear_b (189-200)
  • conv1d (149-162)
  • batch_norm (71-97)
🪛 GitHub Check: Typos
mistralrs-core/src/vision_models/conformer/config.rs

[warning] 93-93:
"seperable" should be "separable".


[warning] 14-14:
"seperable" should be "separable".

mistralrs-core/src/vision_models/conformer/encoder.rs

[warning] 337-337:
"seperable" should be "separable".


[warning] 334-334:
"seperable" should be "separable".


[warning] 333-333:
"seperable" should be "separable".


[warning] 329-329:
"Seperable" should be "Separable".


[warning] 301-301:
"Seperable" should be "Separable".


[warning] 184-184:
"seperable" should be "separable".


[warning] 169-169:
"Seperable" should be "Separable".


[warning] 164-164:
"Seperable" should be "Separable".

⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Docs
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (8)
mistralrs-core/Cargo.toml (1)

97-103:

✅ Verification successful

Verify dependency versions and consider workspace management.

The new audio processing dependencies use hardcoded versions rather than workspace-managed versions like other dependencies in this file. Consider:

  1. Moving these to the workspace Cargo.toml for consistency
  2. Verifying these are the latest stable versions and checking for security advisories

🏁 Script executed:

#!/bin/bash
# Description: Check versions and security advisories for the new audio dependencies

echo "=== Checking latest versions on crates.io ==="
for crate in mel_spec dasp rubato rustfft symphonia hound cpal; do
    echo "Checking $crate..."
    curl -s "https://crates.io/api/v1/crates/$crate" | jq -r ".crate.max_stable_version // .crate.max_version"
done

echo -e "\n=== Checking for known security advisories ==="
# Check RustSec advisory database for these crates
for crate in mel_spec dasp rubato rustfft symphonia hound cpal; do
    echo "Checking advisories for $crate..."
    curl -s "https://raw.githubusercontent.com/RustSec/advisory-db/main/crates/$crate/RUSTSEC-*" 2>/dev/null || echo "No advisories found for $crate"
done

Length of output: 3551


Dependency versions up-to-date; no known security advisories

All seven audio crates in mistralrs-core/Cargo.toml (lines 97–103) match the latest stable releases on crates.io and have no entries in the RustSec advisory database:

• mel_spec 0.3.3
• dasp 0.11.0
• rubato 0.16.2
• rustfft 6.3.0
• symphonia 0.5.4
• hound 3.5.1
• cpal 0.15.3

If these dependencies are used by multiple workspace members, you may wish to centralize their version definitions in the top-level Cargo.toml for consistency. Otherwise, keeping them pinned locally here is acceptable.

mistralrs-core/src/vision_models/mod.rs (1)

6-6: LGTM!

The module declaration follows the existing pattern and is correctly placed among other vision model modules.

mistralrs-core/src/vision_models/conformer/mod.rs (1)

1-4: LGTM! Clean module organization.

The submodule declarations follow standard Rust patterns and logically organize the Conformer implementation components (config, encoder, nemo convolutional subsampling, and positional embeddings).

mistralrs-core/src/layers.rs (3)

10-11: LGTM! Appropriate import additions.

The new imports for BatchNorm, BatchNormConfig, Conv1d, and Conv1dConfig are correctly added to support the new layer construction functions.


71-97: LGTM! Well-implemented batch normalization function.

The batch_norm function correctly:

  • Validates epsilon to prevent negative values
  • Handles both affine and non-affine configurations
  • Loads running statistics from the variable builder
  • Follows the same pattern as other layer construction functions in this file

149-176: LGTM! Consistent 1D convolution implementations.

The conv1d and conv1d_no_bias functions follow the established patterns from the existing conv2d functions:

  • Correct weight tensor shape calculation with group support
  • Proper bias handling (with/without variants)
  • Consistent parameter naming and structure
mistralrs-core/src/vision_models/conformer/config.rs (1)

128-141: Good use of sentinel values pattern

The implementation correctly uses sentinel values to detect and replace uninitialized configuration fields. This ensures the NemoConvConfig is properly initialized with encoder-level defaults.

mistralrs-core/src/vision_models/conformer/encoder.rs (1)

116-121:

❓ Verification inconclusive

Potential dimension mismatch in output projection

The output projection expects input dimension embed_dim / attention_group_size, but the attention weights have dimension embed_dim after reshaping. This will cause a dimension mismatch when attention_group_size > 1.

Run the following script to verify if attention_group_size is used with values > 1 in the codebase:


🏁 Script executed:

#!/bin/bash
# Description: Check for attention_group_size usage in config files or code
rg -A 5 -B 5 "attention_group_size" --type rust

Length of output: 5452


I’d like to inspect the full Attention::forward implementation to confirm how attn_weights is shaped and passed into o_proj. Could you run:

rg -n "pub fn forward" -C5 mistralrs-core/src/vision_models/conformer/encoder.rs

and share the surrounding 10–15 lines?

let bias_idx = if self.symmetric {
bias_idx.abs()?
} else {
(bias_idx + (self.bias_values.embeddings().dim(0)? as f64 / 2.))?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use integer division for bias offset calculation

The offset calculation should use integer division to avoid potential precision issues.

-(bias_idx + (self.bias_values.embeddings().dim(0)? as f64 / 2.))?
+(bias_idx + (self.bias_values.embeddings().dim(0)? / 2) as i64)?
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(bias_idx + (self.bias_values.embeddings().dim(0)? as f64 / 2.))?
(bias_idx + (self.bias_values.embeddings().dim(0)? / 2) as i64)?
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/pos_embed.rs at line 111, the bias
offset calculation currently uses floating-point division which can cause
precision issues. Change the division to integer division by ensuring the
divisor and dividend are integers before dividing, so the offset is calculated
as an integer value rather than a float.

Comment on lines +19 to +23
if cfg.subsampling_factor % 2 != 0 {
candle_core::bail!("Sampling factor should be a multiple of 2!");
}

let sampling_num = (cfg.subsampling_factor as f32).log2() as usize;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate subsampling factor is a power of 2

The code uses log2 which requires the subsampling factor to be a power of 2, but the validation only checks if it's even.

-if cfg.subsampling_factor % 2 != 0 {
-    candle_core::bail!("Sampling factor should be a multiple of 2!");
-}
+if !cfg.subsampling_factor.is_power_of_two() {
+    candle_core::bail!("Subsampling factor must be a power of 2, got {}", cfg.subsampling_factor);
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if cfg.subsampling_factor % 2 != 0 {
candle_core::bail!("Sampling factor should be a multiple of 2!");
}
let sampling_num = (cfg.subsampling_factor as f32).log2() as usize;
if !cfg.subsampling_factor.is_power_of_two() {
candle_core::bail!("Subsampling factor must be a power of 2, got {}", cfg.subsampling_factor);
}
let sampling_num = (cfg.subsampling_factor as f32).log2() as usize;
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/nemo.rs around lines 19 to 23, the
validation currently only checks if subsampling_factor is even, but since log2
is used later, subsampling_factor must be a power of 2. Update the validation to
check that subsampling_factor is a power of 2 by verifying that
subsampling_factor & (subsampling_factor - 1) == 0, and if not, return an error
indicating the subsampling factor must be a power of 2.

serde_default_fn!(bool, default_batch_norm, false);
serde_default_fn!(usize, default_ext_pw_out_channel, 0);
serde_default_fn!(usize, default_ext_pw_kernel_size, 1);
serde_default_fn!(usize, default_depthwise_seperable_out_channel, 256);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo: "seperable" should be "separable"

-serde_default_fn!(usize, default_depthwise_seperable_out_channel, 256);
+serde_default_fn!(usize, default_depthwise_separable_out_channel, 256);
-#[serde(default = "default_depthwise_seperable_out_channel")]
-pub depthwise_seperable_out_channel: usize,
+#[serde(default = "default_depthwise_separable_out_channel")]
+pub depthwise_separable_out_channel: usize,

Also applies to: 93-93

🧰 Tools
🪛 GitHub Check: Typos

[warning] 14-14:
"seperable" should be "separable".

🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/config.rs at lines 14 and 93,
there is a typo in the function name "default_depthwise_seperable_out_channel"
where "seperable" should be corrected to "separable". Rename the function to
"default_depthwise_separable_out_channel" consistently in both locations to fix
the typo.

Comment on lines +169 to +170
impl DepthWiseSeperableConv1d {
fn new(cfg: &ConformerEncoderConfig, padding: usize, vb: ShardedVarBuilder) -> Result<Self> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typos in function and variable names

Multiple occurrences of the typo need to be fixed.

-impl DepthWiseSeperableConv1d {
+impl DepthWiseSeparableConv1d {
-    fn new(cfg: &ConformerEncoderConfig, padding: usize, vb: ShardedVarBuilder) -> Result<Self> {
+    fn new(cfg: &ConformerEncoderConfig, padding: usize, vb: ShardedVarBuilder) -> Result<Self> {
-    dw_sep_conv_1d: DepthWiseSeperableConv1d,
+    dw_sep_conv_1d: DepthWiseSeparableConv1d,
-        let dw_sep_conv_1d = DepthWiseSeperableConv1d::new(cfg, padding, vb.pp("dw_sep_conv_1d"))?;
+        let dw_sep_conv_1d = DepthWiseSeparableConv1d::new(cfg, padding, vb.pp("dw_sep_conv_1d"))?;

Also applies to: 301-301, 329-329

🧰 Tools
🪛 GitHub Check: Typos

[warning] 169-169:
"Seperable" should be "Separable".

🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs at lines 169-170, 301,
and 329, correct the typo in the struct name and any related function or
variable names from "DepthWiseSeperableConv1d" to "DepthWiseSeparableConv1d".
Update all occurrences consistently to fix the spelling error.

vb.pp("dw_conv"),
)?;

let pw_conv = if cfg.depthwise_seperable_out_channel != 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typos in configuration field references

The configuration field names contain typos.

-        let pw_conv = if cfg.depthwise_seperable_out_channel != 0 {
+        let pw_conv = if cfg.depthwise_separable_out_channel != 0 {
-        let ln2 = if cfg.depthwise_seperable_out_channel != 0
-            && cfg.attention_dim != cfg.depthwise_seperable_out_channel
+        let ln2 = if cfg.depthwise_separable_out_channel != 0
+            && cfg.attention_dim != cfg.depthwise_separable_out_channel
-                cfg.depthwise_seperable_out_channel,
+                cfg.depthwise_separable_out_channel,

Also applies to: 333-335

🧰 Tools
🪛 GitHub Check: Typos

[warning] 184-184:
"seperable" should be "separable".

🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs at lines 184 and
333-335, the configuration field names contain typos such as
"depthwise_seperable_out_channel". Correct these typos by replacing "seperable"
with the correct spelling "separable" in all occurrences to ensure proper field
referencing.

}
}

struct DepthWiseSeperableConv1d {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in struct name

The struct name contains a typo.

-struct DepthWiseSeperableConv1d {
+struct DepthWiseSeparableConv1d {
🧰 Tools
🪛 GitHub Check: Typos

[warning] 164-164:
"Seperable" should be "Separable".

🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs at line 164, the struct
name DepthWiseSeperableConv1d contains a typo. Rename the struct to
DepthWiseSeparableConv1d to correct the spelling.

None
};

assert_eq!(cfg.linear_glu_in_convm, false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve assertion with descriptive error message

The assertion should provide a clear error message explaining why this configuration is not supported.

-        assert_eq!(cfg.linear_glu_in_convm, false);
+        assert_eq!(
+            cfg.linear_glu_in_convm, false,
+            "linear_glu_in_convm is not supported in ConvModule"
+        );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert_eq!(cfg.linear_glu_in_convm, false);
assert_eq!(
cfg.linear_glu_in_convm, false,
"linear_glu_in_convm is not supported in ConvModule"
);
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs at line 396, the
assertion checking that cfg.linear_glu_in_convm is false lacks a descriptive
error message. Update the assert_eq! macro to include a clear error message
explaining that enabling linear_glu_in_convm is not supported, so that if the
assertion fails, the reason is immediately clear.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
mistralrs-core/src/vision_models/conformer/encoder.rs (4)

164-164: Fix typos in struct and variable names

The struct name and related references contain spelling errors.

-struct DepthWiseSeperableConv1d {
+struct DepthWiseSeparableConv1d {
-impl DepthWiseSeperableConv1d {
+impl DepthWiseSeparableConv1d {
-    dw_sep_conv_1d: DepthWiseSeperableConv1d,
+    dw_sep_conv_1d: DepthWiseSeparableConv1d,
-        let dw_sep_conv_1d = DepthWiseSeperableConv1d::new(cfg, padding, vb.pp("dw_sep_conv_1d"))?;
+        let dw_sep_conv_1d = DepthWiseSeparableConv1d::new(cfg, padding, vb.pp("dw_sep_conv_1d"))?;

Also applies to: 169-170, 301-301, 329-329

🧰 Tools
🪛 GitHub Check: Typos

[warning] 164-164:
"Seperable" should be "Separable".


184-184: Fix typos in configuration field references

The configuration field names contain spelling errors.

-        let pw_conv = if cfg.depthwise_seperable_out_channel != 0 {
+        let pw_conv = if cfg.depthwise_separable_out_channel != 0 {
-        let ln2 = if cfg.depthwise_seperable_out_channel != 0
-            && cfg.attention_dim != cfg.depthwise_seperable_out_channel
+        let ln2 = if cfg.depthwise_separable_out_channel != 0
+            && cfg.attention_dim != cfg.depthwise_separable_out_channel
        {
            Some(layers::linear(
-                cfg.depthwise_seperable_out_channel,
+                cfg.depthwise_separable_out_channel,

Also applies to: 333-337

🧰 Tools
🪛 GitHub Check: Typos

[warning] 184-184:
"seperable" should be "separable".


273-276: Fix incorrect kernel size reference in causal padding removal

The code uses the wrong kernel size field for causal padding removal.

-        if self.cfg.causal && self.cfg.kernel_size > 1 {
+        if self.cfg.causal && self.cfg.ext_pw_kernel_size > 1 {
             let seq_len = x.dim(2)?;
-            x = x.i((.., .., ..(seq_len - (self.cfg.kernel_size - 1))))?;
+            x = x.i((.., .., ..(seq_len - (self.cfg.ext_pw_kernel_size - 1))))?;
         }

396-396: Add descriptive error message to assertion

The assertion should explain why this configuration is not supported.

-        assert_eq!(cfg.linear_glu_in_convm, false);
+        assert_eq!(
+            cfg.linear_glu_in_convm, false,
+            "linear_glu_in_convm is not supported in ConvModule"
+        );
🧹 Nitpick comments (1)
mistralrs-core/src/vision_models/conformer/encoder.rs (1)

609-611: Consider optimizing padding operation

Using pad_with_zeros creates a new tensor. For better memory efficiency, consider using tensor slicing or views when possible.

For sequences that require padding, you might want to track the original sequence length and use masking instead of explicit padding, which could improve memory efficiency for large batches.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3853cfe and d94134e.

📒 Files selected for processing (1)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
🧰 Additional context used
🪛 GitHub Check: Typos
mistralrs-core/src/vision_models/conformer/encoder.rs

[warning] 337-337:
"seperable" should be "separable".


[warning] 334-334:
"seperable" should be "separable".


[warning] 333-333:
"seperable" should be "separable".


[warning] 329-329:
"Seperable" should be "Separable".


[warning] 301-301:
"Seperable" should be "Separable".


[warning] 184-184:
"seperable" should be "separable".


[warning] 169-169:
"Seperable" should be "Separable".


[warning] 164-164:
"Seperable" should be "Separable".

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
🔇 Additional comments (1)
mistralrs-core/src/vision_models/conformer/encoder.rs (1)

644-709: Well-implemented tensor unfolding with comprehensive tests!

The unfold_tensor function correctly handles sequence chunking with proper dimension manipulation, and the test cases cover various scenarios including edge cases.

Comment on lines 30 to 66
fn new(cfg: &ConformerEncoderConfig, vb: ShardedVarBuilder) -> Result<Self> {
let embed_dim = cfg.attention_dim;
let num_heads = cfg.attention_heads;
let head_dim = embed_dim / num_heads;
let scale = 1.0 / (head_dim as f32).sqrt();

let q_proj = mistralrs_quant::linear(embed_dim, embed_dim, &None, vb.pp("q_proj"))?;
let k_proj = mistralrs_quant::linear(
embed_dim,
embed_dim / cfg.attention_group_size,
&None,
vb.pp("k_proj"),
)?;
let v_proj = mistralrs_quant::linear(
embed_dim,
embed_dim / cfg.attention_group_size,
&None,
vb.pp("v_proj"),
)?;
let o_proj = mistralrs_quant::linear(
embed_dim / cfg.attention_group_size,
embed_dim,
&None,
vb.pp("out_proj"),
)?;

Ok(Self {
embed_dim,
num_heads,
head_dim,
scale,
q_proj,
k_proj,
v_proj,
o_proj,
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation for attention group size

The code assumes attention_group_size evenly divides embed_dim, but doesn't validate this assumption.

Add validation in the constructor:

 fn new(cfg: &ConformerEncoderConfig, vb: ShardedVarBuilder) -> Result<Self> {
     let embed_dim = cfg.attention_dim;
     let num_heads = cfg.attention_heads;
+    
+    // Validate attention group size
+    if embed_dim % cfg.attention_group_size != 0 {
+        candle_core::bail!(
+            "attention_dim ({}) must be divisible by attention_group_size ({})",
+            embed_dim,
+            cfg.attention_group_size
+        );
+    }
+    
     let head_dim = embed_dim / num_heads;
     let scale = 1.0 / (head_dim as f32).sqrt();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn new(cfg: &ConformerEncoderConfig, vb: ShardedVarBuilder) -> Result<Self> {
let embed_dim = cfg.attention_dim;
let num_heads = cfg.attention_heads;
let head_dim = embed_dim / num_heads;
let scale = 1.0 / (head_dim as f32).sqrt();
let q_proj = mistralrs_quant::linear(embed_dim, embed_dim, &None, vb.pp("q_proj"))?;
let k_proj = mistralrs_quant::linear(
embed_dim,
embed_dim / cfg.attention_group_size,
&None,
vb.pp("k_proj"),
)?;
let v_proj = mistralrs_quant::linear(
embed_dim,
embed_dim / cfg.attention_group_size,
&None,
vb.pp("v_proj"),
)?;
let o_proj = mistralrs_quant::linear(
embed_dim / cfg.attention_group_size,
embed_dim,
&None,
vb.pp("out_proj"),
)?;
Ok(Self {
embed_dim,
num_heads,
head_dim,
scale,
q_proj,
k_proj,
v_proj,
o_proj,
})
}
fn new(cfg: &ConformerEncoderConfig, vb: ShardedVarBuilder) -> Result<Self> {
let embed_dim = cfg.attention_dim;
let num_heads = cfg.attention_heads;
// Validate attention group size
if embed_dim % cfg.attention_group_size != 0 {
candle_core::bail!(
"attention_dim ({}) must be divisible by attention_group_size ({})",
embed_dim,
cfg.attention_group_size
);
}
let head_dim = embed_dim / num_heads;
let scale = 1.0 / (head_dim as f32).sqrt();
let q_proj = mistralrs_quant::linear(embed_dim, embed_dim, &None, vb.pp("q_proj"))?;
let k_proj = mistralrs_quant::linear(
embed_dim,
embed_dim / cfg.attention_group_size,
&None,
vb.pp("k_proj"),
)?;
let v_proj = mistralrs_quant::linear(
embed_dim,
embed_dim / cfg.attention_group_size,
&None,
vb.pp("v_proj"),
)?;
let o_proj = mistralrs_quant::linear(
embed_dim / cfg.attention_group_size,
embed_dim,
&None,
vb.pp("out_proj"),
)?;
Ok(Self {
embed_dim,
num_heads,
head_dim,
scale,
q_proj,
k_proj,
v_proj,
o_proj,
})
}
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs around lines 30 to 66,
the constructor assumes that attention_group_size evenly divides embed_dim but
does not validate this. Add a check at the start of the new function to ensure
embed_dim % attention_group_size == 0, and return an error if this condition is
not met to prevent invalid configurations.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
mistralrs-core/src/vision_models/conformer/nemo.rs (1)

21-25: Validate subsampling factor is a power of 2

The code uses log2 which requires the subsampling factor to be a power of 2, but the validation only checks if it's even.

-if cfg.subsampling_factor % 2 != 0 {
-    candle_core::bail!("Sampling factor should be a multiple of 2!");
-}
+if !cfg.subsampling_factor.is_power_of_two() {
+    candle_core::bail!("Subsampling factor must be a power of 2, got {}", cfg.subsampling_factor);
+}
mistralrs-core/src/vision_models/conformer/config.rs (1)

14-14: Fix typo: "seperable" should be "separable"

This typo needs to be fixed in both the function name and field name.

Also applies to: 93-94

🧰 Tools
🪛 GitHub Check: Typos

[warning] 14-14:
"seperable" should be "separable".

mistralrs-core/src/vision_models/conformer/encoder.rs (5)

30-66: Add validation for attention group size

The constructor should validate that attention_dim is divisible by attention_group_size to prevent runtime errors.


164-164: Fix typos in struct and function names

"Seperable" should be "Separable" throughout the implementation.

Also applies to: 169-170, 184-184

🧰 Tools
🪛 GitHub Check: Typos

[warning] 164-164:
"Seperable" should be "Separable".


273-276: Fix incorrect kernel size reference in causal padding removal

The code should use self.cfg.ext_pw_kernel_size instead of self.cfg.kernel_size to match the convolution configuration.


329-329: Fix typos in configuration field references

"seperable" should be "separable" in all field references.

Also applies to: 333-335, 337-337

🧰 Tools
🪛 GitHub Check: Typos

[warning] 329-329:
"Seperable" should be "Separable".


396-396: Improve assertion with descriptive error message

The assertion should include a clear error message explaining why linear_glu_in_convm is not supported.

🧹 Nitpick comments (5)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (3)

23-23: Fix typo in comment

Static analysis correctly identified a typo.

-    /// If vision + speech or only vision (not sure why that is necesary though)
+    /// If vision + speech or only vision (not sure why that is necessary though)
🧰 Tools
🪛 GitHub Check: Typos

[warning] 23-23:
"necesary" should be "necessary".


46-47: Remove commented code

Commented-out code should be removed to maintain code cleanliness.

-        // let audio_dim_in = conformer_config.input_size;

54-54: Consider making MLP depth configurable

The depth is hard-coded to 2, which might limit flexibility for different use cases.

Consider adding a depth field to Phi4MMAudioEmbedConfig to make this configurable:

pub struct Phi4MMAudioEmbedConfig {
    pub n_embd: Option<usize>,
    pub compression_rate: usize,
    pub downsample_rate: usize,
+   pub projection_depth: Option<usize>,
    pub embedding_cls: String,
    pub projection_cls: String,
}

Then use: let depth = audio_embd_config.projection_depth.unwrap_or(2);

mistralrs-core/src/vision_models/conformer/nemo.rs (1)

41-45: Remove commented code

Commented-out code should be removed to maintain code cleanliness.

-        // let max_cache_len = if cfg.is_causal {
-        //     cfg.subsampling_factor+1
-        // } else {
-        //     0
-        // };
mistralrs-core/src/vision_models/conformer/encoder.rs (1)

556-559: Consider improving assertion clarity

The assertion checking for relative attention bias configuration could be more explicit:

-        assert!(cfg
-            .relative_attention_bias_args
-            .as_ref()
-            .is_some_and(|x| x.tp == "t5"));
+        assert!(
+            cfg.relative_attention_bias_args
+                .as_ref()
+                .is_some_and(|x| x.tp == "t5"),
+            "ConformerEncoder requires relative_attention_bias_args with type 't5'"
+        );
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d94134e and 3907feb.

📒 Files selected for processing (8)
  • mistralrs-core/src/vision_models/conformer/config.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/nemo.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/config.rs (4 hunks)
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs (2 hunks)
  • mistralrs-core/src/vision_models/phi4/mod.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • mistralrs-core/src/vision_models/phi4/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs
🧰 Additional context used
🧬 Code Graph Analysis (2)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (2)
mistralrs-core/src/vision_models/conformer/encoder.rs (8)
  • new (30-66)
  • new (132-148)
  • new (170-202)
  • new (222-265)
  • new (310-411)
  • new (473-489)
  • new (526-531)
  • new (548-587)
mistralrs-core/src/utils/unvarbuilder.rs (1)
  • pp (128-130)
mistralrs-core/src/vision_models/conformer/encoder.rs (3)
mistralrs-core/src/vision_models/conformer/nemo.rs (2)
  • new (20-126)
  • forward (149-183)
mistralrs-core/src/vision_models/conformer/pos_embed.rs (4)
  • new (15-60)
  • new (79-98)
  • forward (62-68)
  • forward (100-131)
mistralrs-core/src/layers.rs (31)
  • new (209-212)
  • new (251-256)
  • new (280-286)
  • new (530-561)
  • new (981-997)
  • new (1058-1074)
  • new (1274-1301)
  • new (1456-1466)
  • new (1585-1599)
  • new (1617-1637)
  • new (1666-1680)
  • new (1765-1791)
  • new (1990-2025)
  • linear (178-182)
  • forward (239-241)
  • forward (264-270)
  • forward (288-290)
  • forward (579-654)
  • forward (963-970)
  • forward (1039-1048)
  • forward (1115-1124)
  • forward (1303-1338)
  • forward (1484-1534)
  • forward (1601-1608)
  • forward (1639-1656)
  • forward (1740-1754)
  • forward (1821-1891)
  • layer_norm (56-69)
  • linear_b (189-200)
  • conv1d (149-162)
  • batch_norm (71-97)
🪛 GitHub Check: Typos
mistralrs-core/src/vision_models/phi4/audio_embedding.rs

[warning] 23-23:
"necesary" should be "necessary".

mistralrs-core/src/vision_models/conformer/config.rs

[warning] 14-14:
"seperable" should be "separable".

mistralrs-core/src/vision_models/conformer/encoder.rs

[warning] 337-337:
"seperable" should be "separable".


[warning] 334-334:
"seperable" should be "separable".


[warning] 333-333:
"seperable" should be "separable".


[warning] 329-329:
"Seperable" should be "Separable".


[warning] 301-301:
"Seperable" should be "Separable".


[warning] 184-184:
"seperable" should be "separable".


[warning] 169-169:
"Seperable" should be "Separable".


[warning] 164-164:
"Seperable" should be "Separable".

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Clippy
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
🔇 Additional comments (16)
mistralrs-core/src/vision_models/phi4/mm_embedding.rs (3)

7-7: LGTM: Proper import addition

The import correctly follows the existing pattern and brings in the necessary AudioEmbedding type.


12-12: LGTM: Consistent pattern for audio embedding integration

The audio embedding field and initialization follow the same pattern as the image embedding, maintaining consistency in the codebase structure.

Also applies to: 34-42, 46-46


52-79: Audio embedding is not utilized in the forward method

The forward method only processes image embeddings but doesn't handle the audio embedding that was added to the struct. This suggests incomplete integration.

Should the forward method also process audio embeddings, or is audio processing handled elsewhere in the pipeline? If audio embeddings should be processed here, the method needs to be extended to handle audio inputs similar to how it handles image inputs.

mistralrs-core/src/vision_models/phi4/config.rs (3)

6-9: LGTM: Proper imports for audio configuration

The import additions correctly bring in the necessary types, including the ConformerEncoderConfig dependency.


31-38: LGTM: Well-structured audio embedding configuration

The Phi4MMAudioEmbedConfig struct provides the necessary parameters for audio embedding configuration and is properly integrated into the embedding layer config.

Also applies to: 43-43


52-56: LGTM: Audio processor configuration integration

The Phi4MMAudioConfig struct properly encapsulates the Conformer encoder configuration and is well-integrated into the main Phi4MM config.

Also applies to: 86-86

mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1)

40-44: LGTM: Proper validation and encoder initialization

The validation ensures the required "cascades" audio processor is present, and the encoder initialization follows the established patterns.

mistralrs-core/src/vision_models/conformer/nemo.rs (3)

36-37: LGTM: Proper configuration assertions

The assertions ensure the implementation handles only the supported configuration (non-causal, dw_striding), which is good defensive programming.


47-125: LGTM: Well-structured convolutional layer construction

The layer construction logic properly implements depthwise separable convolutions with appropriate configuration and follows the expected patterns for neural network module initialization.


149-183: LGTM: Comprehensive forward implementation with mask handling

The forward method properly handles tensor transformations, applies the convolutional layers, and correctly computes updated masks for the subsampled output. The mask computation logic correctly scales the lengths according to the subsampling factor.

mistralrs-core/src/vision_models/conformer/config.rs (1)

129-141: Implementation looks good!

The sentinel value approach using usize::MAX to detect uninitialized fields works correctly. While using Option<usize> might be more idiomatic, the current implementation is functional and clear enough with the context.

mistralrs-core/src/vision_models/conformer/encoder.rs (5)

124-162: Well-implemented feed-forward module!

The GLU-based feed-forward implementation correctly follows the Conformer architecture pattern with proper normalization and gating.


463-518: Excellent Conformer layer implementation!

The encoder layer correctly implements the Conformer block architecture with proper 0.5 scaling for feed-forward modules and appropriate residual connections throughout.


520-537: Clean normalization implementation!

The encoder embedding correctly normalizes inputs using global statistics.


589-641: Well-designed encoder forward pass!

The implementation correctly handles:

  • Input normalization and subsampling
  • Sequence unfolding for long inputs (>500 tokens)
  • Positional encoding and relative attention bias
  • Proper restoration of original sequence shape

The chunking strategy for long sequences is particularly well thought out.


644-709: Excellent utility function with comprehensive tests!

The unfold_tensor function correctly implements sequence chunking with proper dimension handling, and the unit tests provide good coverage of various input scenarios.

Comment on lines +59 to +87
let mut layers_for_speech: Vec<Arc<dyn Module + Send + Sync>> =
vec![Arc::new(layers::linear(
audio_dim_out * linear_downsample_rate,
dim_projection,
embedding_cls_vb.pp("speech").pp(0),
)?)];
for i in 1..depth {
layers_for_speech.push(Arc::new(Activation::Gelu));
layers_for_speech.push(Arc::new(layers::linear(
dim_projection,
dim_projection,
embedding_cls_vb.pp("speech").pp(i + 1),
)?));
}

let mut layers_for_vision: Vec<Arc<dyn Module + Send + Sync>> =
vec![Arc::new(layers::linear(
audio_dim_out * linear_downsample_rate,
dim_projection,
embedding_cls_vb.pp("vision").pp(0),
)?)];
for i in 1..depth {
layers_for_vision.push(Arc::new(Activation::Gelu));
layers_for_vision.push(Arc::new(layers::linear(
dim_projection,
dim_projection,
embedding_cls_vb.pp("vision").pp(i + 1),
)?));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Eliminate code duplication in MLP creation

The MLP creation logic is duplicated between speech and vision modes, violating the DRY principle.

Extract the MLP creation into a helper function:

+fn create_mlp_layers(
+    input_dim: usize,
+    output_dim: usize,
+    depth: usize,
+    vb: ShardedVarBuilder,
+) -> Result<Vec<Arc<dyn Module + Send + Sync>>> {
+    let mut layers: Vec<Arc<dyn Module + Send + Sync>> = vec![Arc::new(layers::linear(
+        input_dim,
+        output_dim,
+        vb.pp(0),
+    )?)];
+    
+    for i in 1..depth {
+        layers.push(Arc::new(Activation::Gelu));
+        layers.push(Arc::new(layers::linear(
+            output_dim,
+            output_dim,
+            vb.pp(i + 1),
+        )?));
+    }
+    
+    Ok(layers)
+}

-            let mut layers_for_speech: Vec<Arc<dyn Module + Send + Sync>> =
-                vec![Arc::new(layers::linear(
-                    audio_dim_out * linear_downsample_rate,
-                    dim_projection,
-                    embedding_cls_vb.pp("speech").pp(0),
-                )?)];
-            for i in 1..depth {
-                layers_for_speech.push(Arc::new(Activation::Gelu));
-                layers_for_speech.push(Arc::new(layers::linear(
-                    dim_projection,
-                    dim_projection,
-                    embedding_cls_vb.pp("speech").pp(i + 1),
-                )?));
-            }
-
-            let mut layers_for_vision: Vec<Arc<dyn Module + Send + Sync>> =
-                vec![Arc::new(layers::linear(
-                    audio_dim_out * linear_downsample_rate,
-                    dim_projection,
-                    embedding_cls_vb.pp("vision").pp(0),
-                )?)];
-            for i in 1..depth {
-                layers_for_vision.push(Arc::new(Activation::Gelu));
-                layers_for_vision.push(Arc::new(layers::linear(
-                    dim_projection,
-                    dim_projection,
-                    embedding_cls_vb.pp("vision").pp(i + 1),
-                )?));
-            }
+            let layers_for_speech = create_mlp_layers(
+                audio_dim_out * linear_downsample_rate,
+                dim_projection,
+                depth,
+                embedding_cls_vb.pp("speech"),
+            )?;
+            
+            let layers_for_vision = create_mlp_layers(
+                audio_dim_out * linear_downsample_rate,
+                dim_projection,
+                depth,
+                embedding_cls_vb.pp("vision"),
+            )?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut layers_for_speech: Vec<Arc<dyn Module + Send + Sync>> =
vec![Arc::new(layers::linear(
audio_dim_out * linear_downsample_rate,
dim_projection,
embedding_cls_vb.pp("speech").pp(0),
)?)];
for i in 1..depth {
layers_for_speech.push(Arc::new(Activation::Gelu));
layers_for_speech.push(Arc::new(layers::linear(
dim_projection,
dim_projection,
embedding_cls_vb.pp("speech").pp(i + 1),
)?));
}
let mut layers_for_vision: Vec<Arc<dyn Module + Send + Sync>> =
vec![Arc::new(layers::linear(
audio_dim_out * linear_downsample_rate,
dim_projection,
embedding_cls_vb.pp("vision").pp(0),
)?)];
for i in 1..depth {
layers_for_vision.push(Arc::new(Activation::Gelu));
layers_for_vision.push(Arc::new(layers::linear(
dim_projection,
dim_projection,
embedding_cls_vb.pp("vision").pp(i + 1),
)?));
}
// Add this helper at module scope:
fn create_mlp_layers(
input_dim: usize,
output_dim: usize,
depth: usize,
mut vb: ShardedVarBuilder,
) -> Result<Vec<Arc<dyn Module + Send + Sync>>> {
let mut layers: Vec<Arc<dyn Module + Send + Sync>> = vec![Arc::new(layers::linear(
input_dim,
output_dim,
vb.pp(0),
)?)];
for i in 1..depth {
layers.push(Arc::new(Activation::Gelu));
layers.push(Arc::new(layers::linear(
output_dim,
output_dim,
vb.pp(i + 1),
)?));
}
Ok(layers)
}
// …then replace the duplicated blocks with:
let layers_for_speech = create_mlp_layers(
audio_dim_out * linear_downsample_rate,
dim_projection,
depth,
embedding_cls_vb.pp("speech"),
)?;
let layers_for_vision = create_mlp_layers(
audio_dim_out * linear_downsample_rate,
dim_projection,
depth,
embedding_cls_vb.pp("vision"),
)?;
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/phi4/audio_embedding.rs between lines 59 and
87, the MLP creation code for speech and vision is duplicated. To fix this,
extract the repeated logic into a helper function that takes parameters like the
mode ("speech" or "vision") and returns the constructed Vec of Arc<dyn Module +
Send + Sync>. Then replace the duplicated code with calls to this helper
function for each mode.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (7)
mistralrs-core/src/vision_models/conformer/nemo.rs (1)

19-23: ⚠️ Potential issue

Validate subsampling factor is a power of 2

The validation only checks if the factor is even, but log2 requires it to be a power of 2.

mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1)

62-90: 🛠️ Refactor suggestion

Eliminate code duplication in MLP creation

The MLP creation logic is duplicated between speech and vision modes.

mistralrs-core/src/vision_models/conformer/encoder.rs (5)

164-164: Fix typo in struct name

The struct name contains a typo - "Seperable" should be "Separable".

Also applies to: 169-170

🧰 Tools
🪛 GitHub Check: Typos

[warning] 164-164:
"Seperable" should be "Separable".


273-276: Fix incorrect kernel size reference in causal padding removal

The code uses self.cfg.kernel_size but should use self.cfg.ext_pw_kernel_size to match the convolution configuration.


301-301: Fix typo in field and function names

The field name and function call contain typos - "Seperable" should be "Separable".

Also applies to: 329-329

🧰 Tools
🪛 GitHub Check: Typos

[warning] 301-301:
"Seperable" should be "Separable".


333-335: Fix typos in configuration field references

The configuration field names contain typos - "seperable" should be "separable".

🧰 Tools
🪛 GitHub Check: Typos

[warning] 334-334:
"seperable" should be "separable".


[warning] 333-333:
"seperable" should be "separable".


396-396: Improve assertion with descriptive error message

The assertion should provide a clear error message explaining why this configuration is not supported.

🧹 Nitpick comments (5)
mistralrs-core/src/vision_models/conformer/nemo.rs (1)

37-41: Remove commented code.

Dead code should be removed rather than commented out.

-// let max_cache_len = if cfg.is_causal {
-//     cfg.subsampling_factor+1
-// } else {
-//     0
-// };
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (2)

23-23: Fix typo in comment.

-/// If vision + speech or only vision (not sure why that is necesary though)
+/// If vision + speech or only vision (not sure why that is necessary though)
🧰 Tools
🪛 GitHub Check: Typos

[warning] 23-23:
"necesary" should be "necessary".


17-17: Use integer type for token ID.

Token IDs should be integers for type safety and clarity.

-pub(super) const AUDIO_SPECIAL_TOKEN_ID: f64 = 200011.;
+pub(super) const AUDIO_SPECIAL_TOKEN_ID: u32 = 200011;

You'll also need to update the comparison at line 150:

-let positions = input_ids.eq(AUDIO_SPECIAL_TOKEN_ID)?.nonzero()?;
+let positions = input_ids.eq(AUDIO_SPECIAL_TOKEN_ID as f64)?.nonzero()?;
mistralrs-core/src/vision_models/conformer/encoder.rs (2)

418-421: Remove duplicated causal padding logic

This causal padding removal logic duplicates what's already done in GLUPointWiseConv::forward(). Since the GLU module already handles causal padding for ext_pw_kernel_size, this code appears redundant.

Consider removing this duplicated logic:

-        if self.cfg.causal && self.cfg.ext_pw_kernel_size > 1 {
-            let seq_len = x.dim(1)?;
-            x = x.i((.., ..(seq_len - (self.cfg.ext_pw_kernel_size - 1)), ..))?;
-        }

556-559: Simplify complex assertion for better readability

The nested assertion with is_some_and is hard to read and doesn't provide a clear error message.

Consider simplifying:

-        assert!(cfg
-            .relative_attention_bias_args
-            .as_ref()
-            .is_some_and(|x| x.tp == "t5"));
+        let bias_args = cfg.relative_attention_bias_args.as_ref()
+            .ok_or_else(|| candle_core::Error::Msg("relative_attention_bias_args is required".to_string()))?;
+        if bias_args.tp != "t5" {
+            candle_core::bail!("Only 't5' relative attention bias type is supported, got: {}", bias_args.tp);
+        }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3907feb and 8e10e52.

📒 Files selected for processing (6)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/nemo.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (2 hunks)
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs (3 hunks)
  • mistralrs-core/src/vision_models/phi4/mod.rs (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs
  • mistralrs-core/src/vision_models/phi4/mod.rs
🧰 Additional context used
🧬 Code Graph Analysis (2)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (3)
mistralrs-core/src/vision_models/phi4/mm_embedding.rs (2)
  • new (25-57)
  • forward (59-116)
mistralrs-core/src/vision_models/conformer/encoder.rs (16)
  • new (30-66)
  • new (132-148)
  • new (170-202)
  • new (222-265)
  • new (310-411)
  • new (473-489)
  • new (526-531)
  • new (548-587)
  • forward (68-121)
  • forward (150-161)
  • forward (204-211)
  • forward (267-293)
  • forward (413-460)
  • forward (490-517)
  • forward (533-536)
  • forward (589-641)
mistralrs-core/src/utils/unvarbuilder.rs (1)
  • pp (128-130)
mistralrs-core/src/vision_models/conformer/encoder.rs (5)
mistralrs-core/src/diffusion_models/flux/model.rs (1)
  • attention (79-84)
mistralrs-core/src/vision_models/conformer/nemo.rs (2)
  • new (18-119)
  • forward (142-176)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (2)
  • new (35-102)
  • forward (131-214)
mistralrs-core/src/vision_models/conformer/pos_embed.rs (4)
  • new (15-60)
  • new (79-98)
  • forward (62-68)
  • forward (100-131)
mistralrs-core/src/layers.rs (29)
  • new (209-212)
  • new (251-256)
  • new (280-286)
  • new (530-561)
  • new (981-997)
  • new (1058-1074)
  • new (1274-1301)
  • new (1456-1466)
  • new (1585-1599)
  • new (1617-1637)
  • new (1666-1680)
  • new (1765-1791)
  • linear (178-182)
  • forward (239-241)
  • forward (264-270)
  • forward (288-290)
  • forward (579-654)
  • forward (963-970)
  • forward (1039-1048)
  • forward (1115-1124)
  • forward (1303-1338)
  • forward (1484-1534)
  • forward (1601-1608)
  • forward (1639-1656)
  • forward (1740-1754)
  • layer_norm (56-69)
  • linear_b (189-200)
  • conv1d (149-162)
  • batch_norm (71-97)
🪛 GitHub Check: Typos
mistralrs-core/src/vision_models/phi4/audio_embedding.rs

[warning] 23-23:
"necesary" should be "necessary".

mistralrs-core/src/vision_models/conformer/encoder.rs

[warning] 337-337:
"seperable" should be "separable".


[warning] 334-334:
"seperable" should be "separable".


[warning] 333-333:
"seperable" should be "separable".


[warning] 329-329:
"Seperable" should be "Separable".


[warning] 301-301:
"Seperable" should be "Separable".


[warning] 184-184:
"seperable" should be "separable".


[warning] 169-169:
"Seperable" should be "Separable".


[warning] 164-164:
"Seperable" should be "Separable".

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (1)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

209-211: Audio fields properly added as placeholders.

The addition of audio embedding fields follows the existing pattern and correctly initializes them to None. The TODO comments clearly indicate these are intentional placeholders for future implementation.

Also applies to: 335-337

Comment on lines 32 to 33
assert_eq!(cfg.is_causal, false);
assert_eq!(cfg.subsampling, "dw_striding");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace assertions with proper error handling.

Using assert_eq! for runtime validation will cause panics. Convert to proper error handling for better error messages.

-assert_eq!(cfg.is_causal, false);
-assert_eq!(cfg.subsampling, "dw_striding");
+if cfg.is_causal {
+    candle_core::bail!("Causal mode is not supported");
+}
+if cfg.subsampling != "dw_striding" {
+    candle_core::bail!("Only 'dw_striding' subsampling is supported, got '{}'", cfg.subsampling);
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert_eq!(cfg.is_causal, false);
assert_eq!(cfg.subsampling, "dw_striding");
if cfg.is_causal {
candle_core::bail!("Causal mode is not supported");
}
if cfg.subsampling != "dw_striding" {
candle_core::bail!(
"Only 'dw_striding' subsampling is supported, got '{}'",
cfg.subsampling
);
}
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/nemo.rs at lines 32 to 33, replace
the assert_eq! macros used for runtime validation with proper error handling.
Instead of panicking on mismatch, check the conditions and return or propagate
an error with a clear message indicating which condition failed, ensuring the
program handles invalid states gracefully without crashing.

Comment on lines +157 to +165
let input_embeds = if input_embeds.device().same_device(&target_device)
|| input_embeds.dtype() != target_dtype
{
input_embeds
.to_device(&target_device)?
.to_dtype(target_dtype)?
} else {
input_embeds.clone()
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix device/dtype conversion logic error.

The condition logic is inverted - it checks if devices are the same when it should check if they're different.

-let input_embeds = if input_embeds.device().same_device(&target_device)
-    || input_embeds.dtype() != target_dtype
-{
+let input_embeds = if !input_embeds.device().same_device(&target_device)
+    || input_embeds.dtype() != target_dtype
+{
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/phi4/audio_embedding.rs around lines 157 to
165, the condition incorrectly checks if the input_embeds device is the same as
the target_device to decide on conversion, but it should check if they are
different. Update the condition to trigger device and dtype conversion only when
the devices differ or the dtype is not the target dtype; otherwise, clone
input_embeds without conversion.

Comment on lines 186 to 190
let mut audio_sets = Vec::new();
for (i, size) in audio_embed_sizes.into_iter().enumerate() {
audio_sets.push(audio_set_tensor.i((i, size, ..))?);
}
let merged_audio_set_tensor = Tensor::cat(&audio_sets, 0)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect tensor slicing logic.

The current code incorrectly indexes at position size instead of slicing a range of size size. This needs to track cumulative positions.

-let mut audio_sets = Vec::new();
-for (i, size) in audio_embed_sizes.into_iter().enumerate() {
-    audio_sets.push(audio_set_tensor.i((i, size, ..))?);
-}
-let merged_audio_set_tensor = Tensor::cat(&audio_sets, 0)?;
+let mut audio_sets = Vec::new();
+let mut start = 0;
+for size in audio_embed_sizes.into_iter() {
+    audio_sets.push(audio_set_tensor.i((0, start..start + size, ..))?);
+    start += size;
+}
+let merged_audio_set_tensor = Tensor::cat(&audio_sets, 1)?;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/phi4/audio_embedding.rs around lines 186 to
190, the code incorrectly slices the tensor using a single index 'size' instead
of a range, and does not account for cumulative positions. Fix this by tracking
the cumulative start index for each slice and slicing the tensor over the range
from the cumulative start to start plus size for each audio embedding, ensuring
correct segments are extracted before concatenation.


impl ConformerEncoder {
pub fn new(mut cfg: ConformerEncoderConfig, vb: ShardedVarBuilder) -> Result<Self> {
assert_eq!(cfg.input_layer, "nemo_conv");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace assert with proper error handling

Using assert_eq! for input validation will cause a panic instead of returning a proper error. This prevents graceful error handling by callers.

Replace with proper error handling:

-        assert_eq!(cfg.input_layer, "nemo_conv");
+        if cfg.input_layer != "nemo_conv" {
+            candle_core::bail!("Only 'nemo_conv' input layer is supported, got: {}", cfg.input_layer);
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert_eq!(cfg.input_layer, "nemo_conv");
- assert_eq!(cfg.input_layer, "nemo_conv");
+ if cfg.input_layer != "nemo_conv" {
+ candle_core::bail!(
+ "Only 'nemo_conv' input layer is supported, got: {}",
+ cfg.input_layer
+ );
+ }
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs at line 549, replace
the assert_eq! macro used for input validation with proper error handling that
returns a Result or an error type instead of panicking. Modify the code to check
if cfg.input_layer equals "nemo_conv" and if not, return an appropriate error to
allow callers to handle the failure gracefully.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
mistralrs-core/src/vision_models/phi4/mm_embedding.rs (1)

118-127: 💡 Verification agent

🧩 Analysis chain

Verify if audio embeddings should be included in residual tensors.

The residual_tensors method only includes image embeddings but not audio embeddings. Please verify if this is intentional or if audio embeddings should also be included.


🏁 Script executed:

#!/bin/bash
# Check if other embedding types include residual tensors
ast-grep --pattern 'fn residual_tensors($$$) {
  $$$
}'

Length of output: 57


🏁 Script executed:

#!/bin/bash
# List all implementations of residual_tensors in the repo
rg "residual_tensors" -n

# Find every occurrence of audio_embed to see where it’s defined or used
rg "audio_embed" -n

Length of output: 19848


🏁 Script executed:

#!/bin/bash
# Print the first 200 lines of mm_embedding.rs to inspect audio_embed fields and residual_tensors
sed -n '1,200p' mistralrs-core/src/vision_models/phi4/mm_embedding.rs

Length of output: 4261


🏁 Script executed:

#!/bin/bash
# Check if audio_embedding implements residual_tensors method
rg "fn residual_tensors" -n mistralrs-core/src/vision_models/phi4/audio_embedding.rs || echo "No residual_tensors in audio_embedding.rs"

Length of output: 179


Implement and include audio residual_tensors

The AudioEmbedding type doesn’t provide a residual_tensors method, so Phi4MMImageAudioEmbedding::residual_tensors can only collect the image branch. To support saving audio‐branch parameters, you need to:

  • In mistralrs-core/src/vision_models/phi4/audio_embedding.rs, add:
    impl AudioEmbedding {
        pub fn residual_tensors(&self) -> Vec<(String, Tensor)> {
            let uvb = UnVarBuilder::new();
            // collect all audio‐embedding vars, e.g.
            // uvb.pp("proj").extend(self.projection.residual_tensors());
            uvb.to_safetensors()
        }
    }
  • In mistralrs-core/src/vision_models/phi4/mm_embedding.rs, update residual_tensors:
         pub fn residual_tensors(&self) -> Vec<(String, Tensor)> {
             let uvb = UnVarBuilder::new();
    
             if let Some(image_embed) = &self.image_embed {
                 uvb.pp("image_embed").extend(image_embed.residual_tensors());
             }
    +        if let Some(audio_embed) = &self.audio_embed {
    +            uvb.pp("audio_embed").extend(audio_embed.residual_tensors());
    +        }
    
             uvb.to_safetensors()
         }
♻️ Duplicate comments (7)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (3)

62-90: 🛠️ Refactor suggestion

Eliminate code duplication in MLP creation.

The MLP creation logic is duplicated between speech and vision modes, violating the DRY principle.


155-163: ⚠️ Potential issue

Fix device/dtype conversion logic error.

The condition logic is inverted - it checks if devices are the same when it should check if they're different.


186-187: ⚠️ Potential issue

Fix incorrect tensor slicing logic.

The current code incorrectly uses batch index i to slice the tensor. The audio_set_tensor should be sliced along the sequence dimension to extract segments of the specified sizes.

mistralrs-core/src/vision_models/conformer/encoder.rs (4)

165-170: Fix typos in struct and implementation names

🧰 Tools
🪛 GitHub Check: Typos

[warning] 170-170:
"Seperable" should be "Separable".


[warning] 165-165:
"Seperable" should be "Separable".


185-185: Fix typos in configuration field references

Also applies to: 328-332

🧰 Tools
🪛 GitHub Check: Typos

[warning] 185-185:
"seperable" should be "separable".


391-391: Improve assertion with descriptive error message


544-544: Replace assert with proper error handling

🧹 Nitpick comments (3)
mistralrs-core/src/vision_models/phi4/mm_embedding.rs (1)

96-96: Rename variable for clarity.

The variable name non_image_position_mask is misleading since it specifically represents audio token positions.

-        let non_image_position_mask = input_ids.eq(AUDIO_SPECIAL_TOKEN_ID)?;
+        let audio_position_mask = input_ids.eq(AUDIO_SPECIAL_TOKEN_ID)?;
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

391-428: Replace placeholder audio loading with proper implementation.

The load_dummy_audio function loads from a hardcoded file "dummy_audio.wav" which appears to be placeholder code. This should be replaced with proper audio loading logic that handles actual audio inputs from sequences.

Would you like me to help implement proper audio loading that extracts audio data from the input sequences?

mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1)

23-23: Fix typo in comment.

-    /// If vision + speech or only vision (not sure why that is necesary though)
+    /// If vision + speech or only vision (not sure why that is necessary though)
🧰 Tools
🪛 GitHub Check: Typos

[warning] 23-23:
"necesary" should be "necessary".

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7ce884 and d6f4e99.

📒 Files selected for processing (6)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (10 hunks)
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs (3 hunks)
  • mistralrs-core/src/vision_models/preprocessor_config.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • mistralrs-core/src/vision_models/preprocessor_config.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/vision_models/conformer/encoder.rs (4)
mistralrs-core/src/diffusion_models/flux/model.rs (1)
  • attention (79-84)
mistralrs-core/src/vision_models/conformer/pos_embed.rs (4)
  • new (15-60)
  • new (79-98)
  • forward (62-68)
  • forward (100-131)
mistralrs-core/src/vision_models/conformer/nemo.rs (2)
  • new (18-119)
  • forward (142-176)
mistralrs-core/src/layers.rs (29)
  • new (209-212)
  • new (251-256)
  • new (280-286)
  • new (530-561)
  • new (981-997)
  • new (1058-1074)
  • new (1274-1301)
  • new (1456-1466)
  • new (1585-1599)
  • new (1617-1637)
  • new (1666-1680)
  • new (1765-1791)
  • linear (178-182)
  • forward (239-241)
  • forward (264-270)
  • forward (288-290)
  • forward (579-654)
  • forward (963-970)
  • forward (1039-1048)
  • forward (1115-1124)
  • forward (1303-1338)
  • forward (1484-1534)
  • forward (1601-1608)
  • forward (1639-1656)
  • forward (1740-1754)
  • layer_norm (56-69)
  • linear_b (189-200)
  • conv1d (149-162)
  • batch_norm (71-97)
🪛 GitHub Check: Typos
mistralrs-core/src/vision_models/phi4/audio_embedding.rs

[warning] 23-23:
"necesary" should be "necessary".

mistralrs-core/src/vision_models/conformer/encoder.rs

[warning] 332-332:
"seperable" should be "separable".


[warning] 329-329:
"seperable" should be "separable".


[warning] 328-328:
"seperable" should be "separable".


[warning] 324-324:
"Seperable" should be "Separable".


[warning] 296-296:
"Seperable" should be "Separable".


[warning] 185-185:
"seperable" should be "separable".


[warning] 170-170:
"Seperable" should be "Separable".


[warning] 165-165:
"Seperable" should be "Separable".

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)

Comment on lines +199 to +200
let mut linear_index =
((positions_transposed_0 * (hs_l * hs_b) as f64)? + positions_transposed_1)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect linear index calculation.

The linear index calculation is incorrect. For a tensor of shape (b, l, d) reshaped to (b*l, d), the correct linear index for position (batch, seq_pos) should be batch * l + seq_pos.

-        let mut linear_index =
-            ((positions_transposed_0 * (hs_l * hs_b) as f64)? + positions_transposed_1)?;
+        let mut linear_index =
+            ((positions_transposed_0 * hs_l as f64)? + positions_transposed_1)?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut linear_index =
((positions_transposed_0 * (hs_l * hs_b) as f64)? + positions_transposed_1)?;
let mut linear_index =
((positions_transposed_0 * hs_l as f64)? + positions_transposed_1)?;
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/phi4/audio_embedding.rs at lines 199-200,
the linear index calculation is incorrect for reshaping a tensor from (b, l, d)
to (b*l, d). Replace the current formula with the correct one: linear_index =
batch * l + seq_pos, where batch and seq_pos correspond to
positions_transposed_0 and positions_transposed_1 respectively, and l is the
sequence length dimension.

Comment on lines +521 to +526
fn new(vb: ShardedVarBuilder) -> Result<Self> {
Ok(Self {
global_invstd: vb.get_unchecked("global_invstd")?,
global_mean: vb.get_unchecked("global_mean")?,
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace get_unchecked with safe tensor retrieval

Using get_unchecked bypasses safety checks and could cause runtime panics. Consider using the regular get method for safer error handling.

Apply this diff to use safe tensor retrieval:

 fn new(vb: ShardedVarBuilder) -> Result<Self> {
     Ok(Self {
-        global_invstd: vb.get_unchecked("global_invstd")?,
-        global_mean: vb.get_unchecked("global_mean")?,
+        global_invstd: vb.get((1,), "global_invstd")?,
+        global_mean: vb.get((1,), "global_mean")?,
     })
 }

Note: You'll need to verify the correct tensor shapes for these parameters.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn new(vb: ShardedVarBuilder) -> Result<Self> {
Ok(Self {
global_invstd: vb.get_unchecked("global_invstd")?,
global_mean: vb.get_unchecked("global_mean")?,
})
}
fn new(vb: ShardedVarBuilder) -> Result<Self> {
Ok(Self {
global_invstd: vb.get((1,), "global_invstd")?,
global_mean: vb.get((1,), "global_mean")?,
})
}
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs around lines 521 to
526, replace the use of `get_unchecked` with the safe `get` method when
retrieving tensors from `vb`. This change avoids bypassing safety checks and
prevents potential runtime panics. Update the code to call
`vb.get("global_invstd")?` and `vb.get("global_mean")?` instead, and ensure you
verify and handle the expected tensor shapes appropriately after retrieval.

Comment on lines +551 to +554
assert!(cfg
.relative_attention_bias_args
.as_ref()
.is_some_and(|x| x.tp == "t5"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace assert with proper error handling for bias validation

Using assert! for validation will panic instead of returning a proper error.

Replace with proper error handling:

-        assert!(cfg
-            .relative_attention_bias_args
-            .as_ref()
-            .is_some_and(|x| x.tp == "t5"));
+        if !cfg
+            .relative_attention_bias_args
+            .as_ref()
+            .is_some_and(|x| x.tp == "t5")
+        {
+            candle_core::bail!("ConformerEncoder requires relative_attention_bias_args with type 't5'");
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert!(cfg
.relative_attention_bias_args
.as_ref()
.is_some_and(|x| x.tp == "t5"));
if !cfg
.relative_attention_bias_args
.as_ref()
.is_some_and(|x| x.tp == "t5")
{
candle_core::bail!(
"ConformerEncoder requires relative_attention_bias_args with type 't5'"
);
}
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/conformer/encoder.rs around lines 551 to
554, replace the assert! macro used for validating relative_attention_bias_args
with proper error handling that returns a Result or an error type instead of
panicking. Check if relative_attention_bias_args is Some and its tp field equals
"t5"; if not, return an appropriate error to handle the validation failure
gracefully.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (10)
mistralrs-core/src/vision_models/conformer/encoder.rs (6)

32-68: Add validation for attention group size.

The code assumes attention_group_size evenly divides embed_dim, but doesn't validate this assumption.


164-164: Fix typos in struct and variable names.

Multiple occurrences of "Seperable" should be "Separable", and "seperable" should be "separable" in configuration field references.

Also applies to: 169-170, 184-184, 293-293, 301-301, 321-321, 325-329

🧰 Tools
🪛 GitHub Check: Typos

[warning] 164-164:
"Seperable" should be "Separable".


384-384: Improve assertion with descriptive error message.

The assertion should provide a clear error message explaining why this configuration is not supported.


514-519: Replace get_unchecked with safe tensor retrieval.

Using get_unchecked bypasses safety checks and could cause runtime panics. Consider using the regular get method for safer error handling.


538-538: Replace assert with proper error handling.

Using assert_eq! for input validation will cause a panic instead of returning a proper error.


545-548: Replace assert with proper error handling for bias validation.

Using assert! for validation will panic instead of returning a proper error.

mistralrs-core/src/vision_models/phi4/inputs_processor.rs (4)

70-82: Add validation for required audio configuration fields.

The code uses .expect() which will panic if these fields are missing. Consider providing default values or returning a proper error.


142-142: Remove hardcoded has_audio value.

The has_audio variable is hardcoded to true, which will incorrectly assume all inputs have audio. This should be determined dynamically based on the presence of audio tokens.


399-419: Remove hardcoded dummy audio loading.

The method loads a hardcoded "dummy_audio.wav" file, which appears to be placeholder code. This should load actual audio data from the input sequences.


744-745: Replace dummy audio with actual sequence audio data.

Each sequence that has audio tokens loads the same dummy audio file instead of processing actual audio data associated with the sequence.

🧹 Nitpick comments (1)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

602-603: Document the preemphasis scaling factor.

The magic number 32768.0 represents the scaling factor for 16-bit audio normalization. Consider adding a constant with documentation:

+// Scaling factor for 16-bit audio normalization
+const AUDIO_SCALE_FACTOR: f32 = 32768.0;
+
 // First sample: y[0] = x[0] * 32768
-preemphasized.push(wav[0] * 32768.0);
+preemphasized.push(wav[0] * AUDIO_SCALE_FACTOR);

 // Remaining samples: y[n] = (x[n] - preemphasis * x[n-1]) * 32768
 for i in 1..wav.len() {
-    let filtered = (wav[i] - preemphasis * wav[i - 1]) * 32768.0;
+    let filtered = (wav[i] - preemphasis * wav[i - 1]) * AUDIO_SCALE_FACTOR;
     preemphasized.push(filtered);
 }

Also applies to: 606-607

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34ca7d4 and 5083095.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • mistralrs-core/Cargo.toml (1 hunks)
  • mistralrs-core/src/vision_models/conformer/encoder.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/nemo.rs (1 hunks)
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (10 hunks)
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • mistralrs-core/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (3)
  • mistralrs-core/src/vision_models/conformer/nemo.rs
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs
  • mistralrs-core/src/vision_models/conformer/pos_embed.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/vision_models/conformer/encoder.rs (6)
mistralrs-core/src/diffusion_models/flux/model.rs (1)
  • attention (79-84)
mistralrs-core/src/vision_models/phi4/mm_embedding.rs (2)
  • new (25-57)
  • forward (60-117)
mistralrs-core/src/vision_models/conformer/pos_embed.rs (4)
  • new (18-63)
  • new (83-102)
  • forward (66-72)
  • forward (104-139)
mistralrs-core/src/vision_models/conformer/nemo.rs (2)
  • new (20-121)
  • forward (144-178)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (2)
  • new (35-102)
  • forward (129-212)
mistralrs-core/src/layers.rs (27)
  • new (209-212)
  • new (251-256)
  • new (280-286)
  • new (530-561)
  • new (981-997)
  • new (1058-1074)
  • new (1274-1301)
  • new (1456-1466)
  • new (1585-1599)
  • new (1617-1637)
  • new (1666-1680)
  • linear (178-182)
  • forward (239-241)
  • forward (264-270)
  • forward (288-290)
  • forward (579-654)
  • forward (963-970)
  • forward (1039-1048)
  • forward (1115-1124)
  • forward (1303-1338)
  • forward (1484-1534)
  • forward (1601-1608)
  • forward (1639-1656)
  • layer_norm (56-69)
  • linear_b (189-200)
  • conv1d (149-162)
  • batch_norm (71-97)
🪛 GitHub Check: Typos
mistralrs-core/src/vision_models/conformer/encoder.rs

[warning] 329-329:
"seperable" should be "separable".


[warning] 326-326:
"seperable" should be "separable".


[warning] 325-325:
"seperable" should be "separable".


[warning] 321-321:
"Seperable" should be "Separable".


[warning] 293-293:
"Seperable" should be "Separable".


[warning] 184-184:
"seperable" should be "separable".


[warning] 169-169:
"Seperable" should be "Separable".


[warning] 164-164:
"Seperable" should be "Separable".

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (2)
mistralrs-core/src/vision_models/conformer/encoder.rs (2)

606-607: Clarify commented positional encoding code.

The positional encoding application is commented out. Is this intentional or should it be enabled?


633-698: Well-implemented tensor unfolding with comprehensive tests.

The unfold_tensor function correctly handles sequence chunking with proper dimension handling, and the unit tests provide good coverage of different scenarios.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (2)
mistralrs-server/src/interactive_mode.rs (1)

527-583: ⚠️ Potential issue

Support both images and audio in the same message

The current if-else if structure prevents processing both images and audio in the same message. When both are present, only images are processed and audio URLs are ignored.

Refactor to handle both modalities:

-                if !urls_image.is_empty() {
+                if !urls_image.is_empty() || !urls_audio.is_empty() {
                     let mut image_indexes = Vec::new();
+                    let mut audio_indexes = Vec::new();
+                    
                     // Load all images first
                     for url in &urls_image {
                         match util::parse_image_url(url).await {
                             Ok(image) => {
                                 image_indexes.push(images.len());
                                 images.push(image);
                             }
                             Err(e) => {
                                 error!("Failed to read image from URL/path {}: {}", url, e);
                                 continue 'outer;
                             }
                         }
                     }
+                    
+                    // Load all audios
+                    for url in &urls_audio {
+                        match AudioInput::read_wav(url) {
+                            Ok(audio) => {
+                                audio_indexes.push(audios.len());
+                                audios.push(audio);
+                            }
+                            Err(e) => {
+                                error!("Failed to read audio from URL/path {}: {}", url, e);
+                                continue 'outer;
+                            }
+                        }
+                    }
+                    
-                    // Build a single user message with multiple images and then the text
-                    let mut content_vec: Vec<IndexMap<String, Value>> = Vec::new();
-                    // Add one image part per URL
-                    for _ in &urls_image {
-                        content_vec.push(IndexMap::from([(
-                            "type".to_string(),
-                            Value::String("image".to_string()),
-                        )]));
-                    }
-                    // Add the text part once
-                    let text = prefixer.prefix_image(image_indexes, &text_image);
-                    content_vec.push(IndexMap::from([
-                        ("type".to_string(), Value::String("text".to_string())),
-                        ("text".to_string(), Value::String(text.clone())),
-                    ]));
+                    // Apply prefixing for both images and audio
+                    let mut text = text;
+                    if !image_indexes.is_empty() {
+                        text = prefixer.prefix_image(image_indexes, &text);
+                    }
+                    if !audio_indexes.is_empty() {
+                        text = prefixer.prefix_audio(audio_indexes, &text);
+                    }
+                    
                     let mut user_message: IndexMap<String, MessageContent> = IndexMap::new();
                     user_message.insert("role".to_string(), Either::Left("user".to_string()));
-                    user_message.insert("content".to_string(), Either::Right(content_vec));
-                    messages.push(user_message);
-                } else if !urls_audio.is_empty() {
-                    let mut audio_indexes = Vec::new();
-                    // Load all audios first
-                    for url in &urls_audio {
-                        match AudioInput::read_wav(url) {
-                            Ok(audio) => {
-                                audio_indexes.push(audios.len());
-                                audios.push(audio);
-                            }
-                            Err(e) => {
-                                error!("Failed to read audio from URL/path {}: {}", url, e);
-                                continue 'outer;
-                            }
-                        }
-                    }
-
-                    let text = prefixer.prefix_audio(audio_indexes, &text_audio);
-
-                    let mut user_message: IndexMap<String, MessageContent> = IndexMap::new();
-                    user_message.insert("role".to_string(), Either::Left("user".to_string()));
                     user_message.insert("content".to_string(), Either::Left(text));
-
                     messages.push(user_message);
                 } else {
mistralrs/src/messages.rs (1)

186-199: ⚠️ Potential issue

Inconsistent content structure between image and audio messages.

Image messages use a structured content format with type/text objects, while audio messages use a simple string. This inconsistency could cause issues with message processing.

Consider using consistent content structure. If the current implementation is intentional, please add a comment explaining why audio messages don't need the structured format:

         self.messages.push(IndexMap::from([
             ("role".to_string(), Either::Left(role.to_string())),
-            ("content".to_string(), Either::Left(prefixed)),
+            (
+                "content".to_string(),
+                Either::Right(vec![
+                    IndexMap::from([("type".to_string(), Value::String("audio".to_string()))]),
+                    IndexMap::from([
+                        ("type".to_string(), Value::String("text".to_string())),
+                        ("text".to_string(), Value::String(prefixed)),
+                    ]),
+                ]),
+            ),
         ]));

Also applies to: 227-231

♻️ Duplicate comments (5)
mistralrs/src/messages.rs (1)

487-495: Fix incorrect error message in RequestBuilder's add_audio_message.

Same issue as in VisionMessages - the error message references the wrong method name.

-                anyhow::bail!("`add_image_message` expects a vision model.")
+                anyhow::bail!("`add_audio_message` expects a vision model.")
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (4)

62-90: 🛠️ Refactor suggestion

Eliminate code duplication in MLP creation

The MLP creation logic is duplicated between speech and vision modes, violating the DRY principle. This was already identified in a previous review.


155-163: ⚠️ Potential issue

Fix device/dtype conversion logic error

The condition logic is inverted - it checks if devices are the same when it should check if they're different. This was already identified in a previous review.

-let input_embeds = if input_embeds.device().same_device(&target_device)
-    || input_embeds.dtype() != target_dtype
-{
+let input_embeds = if !input_embeds.device().same_device(&target_device)
+    || input_embeds.dtype() != target_dtype
+{

199-200: ⚠️ Potential issue

Fix incorrect linear index calculation

The linear index calculation is incorrect. For a tensor of shape (b, l, d) reshaped to (b*l, d), the correct linear index for position (batch, seq_pos) should be batch * l + seq_pos. This was already identified in a previous review.

-let mut linear_index =
-    ((positions_transposed_0 * (hs_l * hs_b) as f64)? + positions_transposed_1)?;
+let mut linear_index =
+    ((positions_transposed_0 * hs_l as f64)? + positions_transposed_1)?;

185-188: ⚠️ Potential issue

Fix incorrect tensor slicing logic

The current code incorrectly uses enumeration index i as a batch dimension and doesn't track cumulative positions for slicing audio embeddings.

-let mut audio_sets = Vec::new();
-for (i, size) in audio_embed_sizes.into_iter().enumerate() {
-    audio_sets.push(audio_set_tensor.i((i, ..size, ..))?);
-}
-let merged_audio_set_tensor = Tensor::cat(&audio_sets, 0)?;
+let mut audio_sets = Vec::new();
+let mut start = 0;
+for size in audio_embed_sizes.into_iter() {
+    audio_sets.push(audio_set_tensor.i((start..start + size, ..))?);
+    start += size;
+}
+let merged_audio_set_tensor = Tensor::cat(&audio_sets, 0)?;
🧹 Nitpick comments (2)
mistralrs-server/src/interactive_mode.rs (2)

145-145: Consider documenting the WAV-only limitation

The audio regex only matches .wav files. While this aligns with the current AudioInput::read_wav implementation, consider adding a comment to document this limitation for clarity.

+/// Regex string used to extract audio URLs from prompts (currently only supports WAV format).
 const AUDIO_REGEX: &str = r#"((?:https?://|file://)?\S+?\.(?:wav)(?:\?\S+?)?)"#;

113-117: Add audio examples to the help text

The help text mentions audio support but only provides image examples. Consider adding an audio example for clarity.

 To specify a message with one or more images or audios, simply include the image/audio URL or path:
 
 - `Please describe this image: path/to/image1.jpg path/to/image2.png`
 - `What is in this image: <url here>`
+- `Transcribe this audio: path/to/audio.wav`
+- `What sounds are in this recording: https://example.com/audio.wav`
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ea5b25 and 558eb99.

📒 Files selected for processing (12)
  • .typos.toml (1 hunks)
  • mistralrs-core/src/engine/add_request.rs (3 hunks)
  • mistralrs-core/src/lib.rs (1 hunks)
  • mistralrs-core/src/pipeline/amoe.rs (1 hunks)
  • mistralrs-core/src/request.rs (2 hunks)
  • mistralrs-core/src/sequence.rs (8 hunks)
  • mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (10 hunks)
  • mistralrs-pyo3/src/lib.rs (1 hunks)
  • mistralrs-server-core/src/chat_completion.rs (1 hunks)
  • mistralrs-server/src/interactive_mode.rs (7 hunks)
  • mistralrs/src/messages.rs (10 hunks)
✅ Files skipped from review due to trivial changes (5)
  • .typos.toml
  • mistralrs-server-core/src/chat_completion.rs
  • mistralrs-core/src/pipeline/amoe.rs
  • mistralrs-pyo3/src/lib.rs
  • mistralrs-core/src/lib.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/engine/add_request.rs (1)
mistralrs-core/src/sequence.rs (6)
  • images (238-240)
  • images (307-309)
  • images (1044-1046)
  • audios (200-202)
  • audios (335-337)
  • audios (1064-1066)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Clippy
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Docs
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Check (macOS-latest, stable)
🔇 Additional comments (4)
mistralrs-core/src/engine/add_request.rs (2)

130-139: LGTM!

The audio extraction logic correctly mirrors the existing pattern for image extraction and properly handles the cloning of audio data.


511-512: Consistent parameter passing

The audios parameter is correctly passed to the sequence initialization, maintaining consistency with the images parameter handling.

mistralrs-core/src/request.rs (2)

52-59: Consistent field addition with existing TODO

The audios field is correctly added with the same serialization skip pattern as images. Note the existing TODO comments suggest future serialization support might be planned.


121-128: Well-structured audio input type

The AudioInput struct is appropriately designed with standard audio data fields and necessary trait implementations.

Comment on lines 744 to 752
let (audio_data, sample_rate) = if let Some(mut audios) = seq.take_audios() {
if let Some(audio) = audios.pop() {
(audio.samples, audio.sample_rate)
} else {
self.load_dummy_audio()?
}
} else {
self.load_dummy_audio()?
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add warning when falling back to dummy audio.

The code silently falls back to dummy audio when no audio data is available. This could mask issues where audio data is expected but missing.

Consider logging a warning when falling back to dummy audio:

                 let (audio_data, sample_rate) = if let Some(mut audios) = seq.take_audios() {
                     if let Some(audio) = audios.pop() {
                         (audio.samples, audio.sample_rate)
                     } else {
+                        warn!("No audio data found in sequence, using dummy audio");
                         self.load_dummy_audio()?
                     }
                 } else {
+                    warn!("No audio inputs in sequence, using dummy audio");
                     self.load_dummy_audio()?
                 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let (audio_data, sample_rate) = if let Some(mut audios) = seq.take_audios() {
if let Some(audio) = audios.pop() {
(audio.samples, audio.sample_rate)
} else {
self.load_dummy_audio()?
}
} else {
self.load_dummy_audio()?
};
let (audio_data, sample_rate) = if let Some(mut audios) = seq.take_audios() {
if let Some(audio) = audios.pop() {
(audio.samples, audio.sample_rate)
} else {
warn!("No audio data found in sequence, using dummy audio");
self.load_dummy_audio()?
}
} else {
warn!("No audio inputs in sequence, using dummy audio");
self.load_dummy_audio()?
};
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/phi4/inputs_processor.rs around lines 744 to
752, the code falls back to loading dummy audio silently when no audio data is
available, which can hide potential issues. Add a warning log message before
each call to self.load_dummy_audio() to indicate that the fallback to dummy
audio is occurring, helping to surface unexpected missing audio data during
execution.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
mistralrs-core/src/request.rs (1)

142-148: ⚠️ Potential issue

Fix error type inconsistency (duplicate issue)

The error handling still uses candle_core::Error::Msg instead of anyhow::Error::msg, which is inconsistent with the rest of the method.

Apply this fix:

-                        .map_err(|e| candle_core::Error::Msg(e.to_string()))
+                        .map_err(|e| anyhow::Error::msg(e.to_string()))
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (3)

57-85: 🛠️ Refactor suggestion

Eliminate code duplication in MLP creation

The MLP creation logic is duplicated between speech and vision modes, violating the DRY principle.

Extract the MLP creation into a helper function:

+fn create_mlp_layers(
+    input_dim: usize,
+    output_dim: usize,
+    depth: usize,
+    vb: ShardedVarBuilder,
+) -> Result<Vec<Arc<dyn Module + Send + Sync>>> {
+    let mut layers: Vec<Arc<dyn Module + Send + Sync>> = vec![Arc::new(layers::linear(
+        input_dim,
+        output_dim,
+        vb.pp(0),
+    )?)];
+    
+    for i in 1..depth {
+        layers.push(Arc::new(Activation::Gelu));
+        layers.push(Arc::new(layers::linear(
+            output_dim,
+            output_dim,
+            vb.pp(i + 1),
+        )?));
+    }
+    
+    Ok(layers)
+}

-            let mut layers_for_speech: Vec<Arc<dyn Module + Send + Sync>> =
-                vec![Arc::new(layers::linear(
-                    audio_dim_out * linear_downsample_rate,
-                    dim_projection,
-                    embedding_cls_vb.pp("speech").pp(0),
-                )?)];
-            for i in 1..depth {
-                layers_for_speech.push(Arc::new(Activation::Gelu));
-                layers_for_speech.push(Arc::new(layers::linear(
-                    dim_projection,
-                    dim_projection,
-                    embedding_cls_vb.pp("speech").pp(i + 1),
-                )?));
-            }
-
-            let mut layers_for_vision: Vec<Arc<dyn Module + Send + Sync>> =
-                vec![Arc::new(layers::linear(
-                    audio_dim_out * linear_downsample_rate,
-                    dim_projection,
-                    embedding_cls_vb.pp("vision").pp(0),
-                )?)];
-            for i in 1..depth {
-                layers_for_vision.push(Arc::new(Activation::Gelu));
-                layers_for_vision.push(Arc::new(layers::linear(
-                    dim_projection,
-                    dim_projection,
-                    embedding_cls_vb.pp("vision").pp(i + 1),
-                )?));
-            }
+            let layers_for_speech = create_mlp_layers(
+                audio_dim_out * linear_downsample_rate,
+                dim_projection,
+                depth,
+                embedding_cls_vb.pp("speech"),
+            )?;
+            
+            let layers_for_vision = create_mlp_layers(
+                audio_dim_out * linear_downsample_rate,
+                dim_projection,
+                depth,
+                embedding_cls_vb.pp("vision"),
+            )?;

147-155: ⚠️ Potential issue

Fix device/dtype conversion logic error

The condition logic is inverted - it checks if devices are the same when it should check if they're different.

-let input_embeds = if input_embeds.device().same_device(&target_device)
-    || input_embeds.dtype() != target_dtype
-{
+let input_embeds = if !input_embeds.device().same_device(&target_device)
+    || input_embeds.dtype() != target_dtype
+{

191-192: ⚠️ Potential issue

Fix incorrect linear index calculation

The linear index calculation is incorrect. For a tensor of shape (b, l, d) reshaped to (b*l, d), the correct linear index for position (batch, seq_pos) should be batch * l + seq_pos.

-        let mut linear_index =
-            ((positions_transposed_0 * (hs_l * hs_b) as f64)? + positions_transposed_1)?;
+        let mut linear_index =
+            ((positions_transposed_0 * hs_l as f64)? + positions_transposed_1)?;
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

70-82: ⚠️ Potential issue

Add validation for required audio configuration fields

The code uses .expect() which will panic if these fields are missing. Consider providing default values or returning a proper error.

-            inputs_processor: Arc::new(Phi4MMInputsProcessor {
-                audio_compression_rate: pre_processor_config
-                    .audio_compression_rate
-                    .expect("audio_compression_rate"),
-                audio_downsample_rate: pre_processor_config
-                    .audio_downsample_rate
-                    .expect("audio_downsample_rate"),
-                audio_feat_stride: pre_processor_config
-                    .audio_feat_stride
-                    .expect("audio_feat_stride"),
-                eightk_method: "fillzero".to_string(), // Default to fillzero
-            }),
+            let audio_compression_rate = pre_processor_config
+                .audio_compression_rate
+                .ok_or_else(|| anyhow::Error::msg("audio_compression_rate is required"))?;
+            let audio_downsample_rate = pre_processor_config
+                .audio_downsample_rate
+                .ok_or_else(|| anyhow::Error::msg("audio_downsample_rate is required"))?;
+            let audio_feat_stride = pre_processor_config
+                .audio_feat_stride
+                .ok_or_else(|| anyhow::Error::msg("audio_feat_stride is required"))?;
+            
+            inputs_processor: Arc::new(Phi4MMInputsProcessor {
+                audio_compression_rate,
+                audio_downsample_rate,
+                audio_feat_stride,
+                eightk_method: "fillzero".to_string(),
+            }),
🧹 Nitpick comments (1)
mistralrs-server-core/src/util.rs (1)

84-122: Consider refactoring to reduce code duplication

The parse_audio_url function duplicates most of the logic from parse_image_url. Consider extracting a generic helper function to handle URL parsing and byte retrieval, then have both functions call it with their respective processors.

Example refactoring approach:

+async fn parse_url_to_bytes(url_unparsed: &str) -> Result<Vec<u8>, anyhow::Error> {
+    // Extract common URL parsing and byte retrieval logic here
+}

 pub async fn parse_audio_url(url_unparsed: &str) -> Result<AudioInput, anyhow::Error> {
-    // ... existing URL parsing logic ...
+    let bytes = parse_url_to_bytes(url_unparsed).await?;
     AudioInput::from_bytes(&bytes)
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 558eb99 and b81ca71.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • Cargo.toml (1 hunks)
  • mistralrs-core/Cargo.toml (1 hunks)
  • mistralrs-core/src/engine/add_request.rs (6 hunks)
  • mistralrs-core/src/prefix_cacher.rs (13 hunks)
  • mistralrs-core/src/request.rs (2 hunks)
  • mistralrs-core/src/sequence.rs (8 hunks)
  • mistralrs-core/src/utils/debug.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/audio_embedding.rs (1 hunks)
  • mistralrs-core/src/vision_models/phi4/config.rs (5 hunks)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (10 hunks)
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs (2 hunks)
  • mistralrs-core/src/vision_models/phi4/mod.rs (5 hunks)
  • mistralrs-server-core/src/util.rs (3 hunks)
  • mistralrs-server/src/interactive_mode.rs (6 hunks)
✅ Files skipped from review due to trivial changes (1)
  • Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (7)
  • mistralrs-core/Cargo.toml
  • mistralrs-core/src/engine/add_request.rs
  • mistralrs-server/src/interactive_mode.rs
  • mistralrs-core/src/vision_models/phi4/mm_embedding.rs
  • mistralrs-core/src/vision_models/phi4/mod.rs
  • mistralrs-core/src/vision_models/phi4/config.rs
  • mistralrs-core/src/sequence.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
mistralrs-server-core/src/util.rs (1)
mistralrs-core/src/request.rs (1)
  • from_bytes (158-218)
mistralrs-core/src/prefix_cacher.rs (1)
mistralrs-core/src/sequence.rs (2)
  • audio_hashes (346-348)
  • audio_hashes (1075-1077)
mistralrs-core/src/request.rs (2)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)
  • audio (727-731)
mistralrs-core/src/sequence.rs (4)
  • new (181-194)
  • new (223-233)
  • new (274-290)
  • new (1130-1154)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Clippy
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Docs
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
🔇 Additional comments (10)
mistralrs-core/src/utils/debug.rs (1)

24-26: LGTM: Appropriate log level adjustment for audio processing

The addition of the symphonia log suppression directive is well-placed and necessary. Audio decoding can generate verbose logs, and setting the level to warn helps maintain clean log output while preserving important messages.

mistralrs-server-core/src/util.rs (1)

195-200: Good test coverage for audio URL parsing

The test case appropriately validates the new audio URL parsing functionality using a base64-encoded WAV file and checks essential audio properties.

mistralrs-core/src/request.rs (2)

121-128: Well-designed AudioInput struct with comprehensive audio support

The AudioInput struct provides a clean interface for audio data, and the from_bytes method offers robust support for multiple audio formats through symphonia. The error handling is consistent and appropriate for the various failure modes in audio decoding.

Also applies to: 158-218


55-56: Consistent integration of audio support with existing vision patterns

The addition of the audios field follows the same pattern as the existing images field, maintaining consistency in the multimodal request structure.

mistralrs-core/src/prefix_cacher.rs (3)

60-60: Consistent extension of cache structures for audio support

The addition of audio_hashes and audios_to_keep fields appropriately extends the caching mechanism to support audio inputs alongside existing image support.

Also applies to: 69-69, 85-85, 95-95


255-261: Improved method signature for multimodal cache matching

The updated signature with explicit audio_hashes parameter provides more granular control for cache matching compared to the previous boolean flag approach.


304-325: Robust audio hash matching logic consistent with existing patterns

The audio hash matching implementation correctly mirrors the image hash matching logic, computing the longest matching prefix and integrating seamlessly with the existing cache matching algorithm.

Also applies to: 374-385, 410-448

mistralrs-core/src/vision_models/phi4/inputs_processor.rs (3)

415-570: Well-implemented audio processing pipeline!

The audio processing implementation is comprehensive and follows best practices:

  • Proper resampling using rubato with appropriate parameters
  • Correct preemphasis filter implementation
  • Well-structured mel spectrogram extraction using FFT
  • Appropriate handling of different sample rates (8kHz/16kHz)

699-805: Robust audio sequence processing implementation!

The implementation correctly handles:

  • Dynamic checking for audio tokens
  • Proper multi-channel to mono conversion
  • Correct padding to maximum sequence length
  • Appropriate attention mask generation for batched sequences

660-679: Correct audio embed size calculation!

The implementation properly handles:

  • Ceiling division for compression rate
  • Two-stage compression (initial and qformer)
  • Correct rounding up for remainder values


let mut proj = HashMap::new();
{
assert_eq!(audio_embd_config.projection_cls, "mlp");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace assertion with proper error handling

Using assert_eq! will cause a panic if the projection class is not "mlp". This should return a proper error instead.

-            assert_eq!(audio_embd_config.projection_cls, "mlp");
+            if audio_embd_config.projection_cls != "mlp" {
+                candle_core::bail!("Only 'mlp' projection class is supported, got: {}", audio_embd_config.projection_cls);
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert_eq!(audio_embd_config.projection_cls, "mlp");
if audio_embd_config.projection_cls != "mlp" {
candle_core::bail!("Only 'mlp' projection class is supported, got: {}", audio_embd_config.projection_cls);
}
🤖 Prompt for AI Agents
In mistralrs-core/src/vision_models/phi4/audio_embedding.rs at line 49, replace
the assert_eq! macro that checks if audio_embd_config.projection_cls equals
"mlp" with proper error handling. Instead of panicking, implement a conditional
check that returns a Result with an appropriate error if the condition is not
met, ensuring the function gracefully handles the invalid projection class
scenario.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

70-81: Configuration validation needs improvement.

The use of .expect() calls for audio configuration fields will cause panics if these fields are missing. This has been flagged in past reviews but appears unaddressed.

This issue was previously identified and should be addressed by replacing .expect() calls with proper error handling that returns meaningful errors instead of panicking.

🧹 Nitpick comments (1)
mistralrs-audio/src/lib.rs (1)

112-151: Comprehensive test coverage with proper cleanup.

The tests cover both WAV file reading and byte decoding scenarios with appropriate assertions. The file cleanup in the WAV test ensures no temporary files are left behind. Consider using a more robust temporary file mechanism for better cross-platform compatibility.

For better cross-platform compatibility, consider using std::env::temp_dir() instead of hardcoded /tmp/:

-        let mut writer = WavWriter::create("/tmp/test.wav", spec).unwrap();
+        let temp_path = std::env::temp_dir().join("test.wav");
+        let mut writer = WavWriter::create(&temp_path, spec).unwrap();

And update the cleanup accordingly:

-        let input = AudioInput::read_wav("/tmp/test.wav").unwrap();
+        let input = AudioInput::read_wav(&temp_path).unwrap();
        assert_eq!(input.samples.len(), 160);
        assert_eq!(input.sample_rate, 16000);
-        std::fs::remove_file("/tmp/test.wav").unwrap();
+        std::fs::remove_file(&temp_path).unwrap();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b81ca71 and ff67bf8.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml (3 hunks)
  • mistralrs-audio/Cargo.toml (1 hunks)
  • mistralrs-audio/README.md (1 hunks)
  • mistralrs-audio/src/lib.rs (1 hunks)
  • mistralrs-core/Cargo.toml (1 hunks)
  • mistralrs-core/src/lib.rs (2 hunks)
  • mistralrs-core/src/request.rs (2 hunks)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (10 hunks)
✅ Files skipped from review due to trivial changes (2)
  • mistralrs-audio/README.md
  • mistralrs-audio/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (4)
  • mistralrs-core/Cargo.toml
  • mistralrs-core/src/lib.rs
  • mistralrs-core/src/request.rs
  • Cargo.toml
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (6)
mistralrs-core/src/sequence.rs (10)
  • new (181-194)
  • new (223-233)
  • new (274-290)
  • new (1130-1154)
  • has_audios (350-354)
  • has_audios (1079-1081)
  • toks (757-757)
  • audios (200-202)
  • audios (342-344)
  • audios (1071-1073)
mistralrs-core/src/vision_models/phi4/audio_embedding.rs (2)
  • new (30-97)
  • audio_embed_sizes (167-167)
mistralrs-core/src/pipeline/mod.rs (1)
  • device (153-153)
mistralrs-core/src/pipeline/vision.rs (1)
  • device (830-832)
mistralrs-core/src/pipeline/speech.rs (1)
  • device (361-363)
mistralrs-core/src/vision_models/phi4/image_embedding.rs (1)
  • device (24-24)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Docs
  • GitHub Check: Clippy
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
🔇 Additional comments (15)
mistralrs-audio/src/lib.rs (4)

13-19: Well-designed public API structure.

The AudioInput struct provides a clean interface with appropriate public fields for audio data handling. The field types are well-chosen for audio processing needs.


23-40: Robust WAV file reading with proper format handling.

The implementation correctly handles both float and integer sample formats with appropriate normalization for i16 samples. The error propagation using ? operator ensures proper error handling throughout the chain.


43-88: Comprehensive audio decoding with good error handling.

The symphonia-based decoding implementation properly handles format probing, track selection, and packet processing. The EOF handling with UnexpectedEof check is appropriate for audio streaming. The cast truncation allowance on line 62 is reasonable for channel count conversion.


91-104: Efficient mono conversion implementation.

The channel averaging logic is mathematically correct and handles the edge case of single-channel audio efficiently by returning a clone. The division by channel count appropriately normalizes the averaged samples.

mistralrs-core/src/vision_models/phi4/inputs_processor.rs (11)

12-17: Appropriate dependencies for audio processing.

The imported crates provide the necessary functionality for audio resampling (rubato), FFT operations (rustfft), and windowing functions (apodize). These are well-established libraries for audio signal processing.


42-57: Well-structured audio processing configuration.

The addition of audio-specific constants and configuration fields to the processor struct provides the necessary parameters for audio feature extraction. The eightk_method field allows flexibility in handling different sample rates.


142-194: Logical audio detection and processing flow.

The implementation correctly detects audio presence and processes audio sequences alongside existing image processing. The fallback to text processing when neither audio nor images are present maintains backward compatibility.


268-291: Consistent token replacement pattern.

The audio token replacement follows the same pattern as image token replacement, ensuring consistency in the token processing pipeline. The regex-based replacement is appropriate for handling various audio token formats.


293-331: Proper integration of audio processing.

The audio processing integration maintains the same structure as image processing, with appropriate error handling and token expansion logic. The audio embed size handling parallels the image token count handling.


399-413: Well-structured audio feature extraction interface.

The extract_audio_features method provides a clean interface that encapsulates the audio processing pipeline, from resampling to mel spectrogram extraction.


415-487: Robust audio resampling implementation.

The resampling logic properly handles different sample rates and uses the high-quality rubato library with appropriate parameters. The error handling and fallback logic for unsupported rates is well-implemented.


489-570: Comprehensive mel spectrogram extraction.

The FFT-based mel spectrogram implementation is mathematically sound with proper windowing, preemphasis, and mel filterbank application. The handling of different sample rates (8kHz vs 16kHz) with appropriate parameters is correct.


572-589: Correct preemphasis filter implementation.

The preemphasis filter correctly implements the standard formula y[n] = (x[n] - α * x[n-1]) * scale with appropriate scaling. The initialization with the first sample is mathematically correct.


591-658: Accurate mel filterbank implementation.

The mel filterbank creation uses correct mel scale conversions and triangular filter construction. The frequency-to-bin mapping and filter coefficient calculations follow standard audio processing practices.


660-796: Complete audio processing pipeline.

The audio processing methods properly handle sequence-level audio data extraction, feature computation, tensor creation, and batch preparation. The attention mask creation and padding logic ensures proper integration with the transformer architecture.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

70-81: Add validation for required audio configuration fields.

The code still uses .expect() which will panic if these fields are missing. This was previously flagged and should use proper error handling instead.

🧹 Nitpick comments (2)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (2)

193-195: Clarify variable naming for audio-only processing.

Setting num_img_tokens to empty vectors when processing audio-only inputs is confusing since the variable name suggests image tokens. Consider renaming to something more generic like num_modal_tokens or handling audio and image tokens separately.


306-307: Simplify audio embed sizes handling.

The audio_sizes_tmp cloning and unwrap_or pattern is convoluted. Consider simplifying this logic for better readability.

-            let audio_sizes_tmp = audio_embed_sizes.clone().unwrap_or(vec![]);
-            let mut audio_embed_sizes = audio_sizes_tmp.iter();
+            let mut audio_embed_sizes = audio_embed_sizes.as_ref().map(|v| v.iter()).unwrap_or_else(|| [].iter());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff67bf8 and 167f1b5.

📒 Files selected for processing (1)
  • mistralrs-core/src/vision_models/phi4/inputs_processor.rs (10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Docs
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Clippy
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (3)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (3)

12-50: LGTM - Well-structured audio processing additions.

The new audio processing imports, constants, and type definitions are well-organized and follow consistent patterns with the existing image processing code.


398-796: Comprehensive audio processing implementation - well done!

The audio processing functionality is well-implemented and addresses all the concerns from past reviews:

  • Uses actual audio data from sequences instead of dummy audio
  • Proper error handling with descriptive messages
  • Correct preemphasis filter implementation
  • Comprehensive mel spectrogram extraction with FFT

The implementation successfully integrates audio processing with the existing multimodal pipeline.


299-395: Excellent integration of audio processing with existing image pipeline.

The token expansion logic properly handles both image and audio tokens, and the integration maintains consistency with existing patterns. The multimodal input processing is well-structured.

@EricLBuehler EricLBuehler merged commit fff2665 into master Jun 9, 2025
12 of 13 checks passed
@EricLBuehler EricLBuehler deleted the phi4_multimodal_audio branch June 9, 2025 13:53
Jeadie added a commit to spiceai/mistral.rs that referenced this pull request Jul 14, 2025
* Fix handling of Metal fused attn head dims (EricLBuehler#1234)

* Fix handling of metal attn head dims

* Fix handling of gemma3 1b when images

* Tweak default for paged attn builder

* Support paged attn for vision model rust api (EricLBuehler#1235)

* [Breaking] Support setting HF cache path (EricLBuehler#1237)

* Add it internally

* Add the apis

* Support tool calling for DeepSeek models (EricLBuehler#1239)

* Support tool calling for deepseek models

* Format

* Fix deepseek

* Server image processing refactor and fixes (EricLBuehler#1244)

* Fix strict gemma3 case

* Accept multiple images in the content array

* Fix multiple images in one array ct

* Add it to the python api

* Typos

* Optimized CUDA RoPE kernels (EricLBuehler#1247)

* Add the kernels

* It works

* Works

* Buulds

* Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246)

* Fix typo

* Update mistralrs.pyi

* Fixes for UQFF + distributed layers (EricLBuehler#1250)

* Fixes for uqff + distributed layers

* Typo

* Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243)

* Add the tool

* Actually search

* Clippy

* Sort of works

* Remove some debuggers

* tweak

* Add some rules

* Works great

* Tweak 'system' prompt

* Update mistralrs-core/src/search/mod.rs

Co-authored-by: Copilot <[email protected]>

* Typo

* Add it to all the apis

* Add bert model for similarity reranking

* Typos

* Early detection of tools

* Alias max_tokens -> max_completion_tokens too

* Customizable bert model

* Flip the enabler around

* Add docs

* Update readme

* Typo

---------

Co-authored-by: Copilot <[email protected]>

* Format kernels (EricLBuehler#1251)

* Update readme

* Update readme

* Remove test

* Add quantize guards for uqff deserialize (EricLBuehler#1252)

* Refactor cuBLASlt-related code (EricLBuehler#1253)

* Centralize cublaslt into mistralrs-quant

* Use cublaslt in unquant layer

* Use beautiful trait constants for simpler code

* Move tests

* Dispatch to unquant for cublaslt

* Dispatch to unquant for cublaslt

* Fix feature

* Add convert_to_gptq script

* Update deps, bump pyo3 version (EricLBuehler#1259)

* Faster cuda FP8 performance (EricLBuehler#1257)

* Avoid fp8 sync

* Fix dtype

* Rust 1.86 clippy (EricLBuehler#1260)

* Rust 1.86 clippy

* Clippy

* Refactor engine arch (EricLBuehler#1262)

* Refactor engine add_request

* Don't recompile regex

* Clippy

* Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263)

* Play with varbuilder lifetimes

* Merge lora weights

* Clippy

* Lora works

* Support multiple loras

* Cleanup, remove adapter activation

* Complete merge

* Fast Metal-specific quantization method: AFQ (EricLBuehler#1264)

* Add mlx quantized kernels

* Add mlx quantized kernels

* Kernel launcher

* Add AFQ isq quant and dequant

* Some quantmethod things

* Begin to implement the qmm caller

* Clippy

* Much faster

* Cache kernels

* Docs

* Clippy

* Add it to uqff

* Support prequantized models from MLX (EricLBuehler#1265)

* Refactor quantizedconfig

* Support AFQ prequantized

* Update docs

* Update docs

* Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266)

* Automatic isq

* typo

* Doc

* Improved usage metrics (EricLBuehler#1267)

* Fix cuda

* Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270)

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.44.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gather MM ops in mistralrs-quant (EricLBuehler#1272)

* Update the caller

* Wire things up

* Broadcase for afq gathermm

* Broadcase for afq gathermm

* Clippy

* Improve performance of deepseek models

* Typo fix

* BincountOp not used

* Implement Llama 4! (EricLBuehler#1268)

* Implement Llama 4

* Implement the main changes for the text model

* Make chunked mask

* Wire things up

* Add some EP

* Initial sketch of inputs processor

* Runs

* Progress

* all reduce moes

* It works!

* Some cleanup

* Faster moe block

* Add device map

* Make chunked matrix

* Fully working now!

* Reactivate cublaslt

* Fix shared mlp cublaslt

* Refactor to packed experts

* Complete merge

* It is a normal model now

* Fixes

* Set device for moe

* ISQ fixes

* Much faster sort kernel

* Faster loading!

* Faster loading!

* Fp8 cpu copy ops in candle backend

* Add the vision model

* Add mmproj layer

* Actually merge the inputs

* Sketch most of the image processor

* Add the rest of the image processor

* Implement the whole processor

* Add the loader

* Some fixes

* A batch of fixes

* Some fixes

* tmp

* Actually support isq

* Ok it works a bit

* Fix norm device

* It works

* A bit cleaner

* Support residul tensors

* Remove text loader

* Implement the device mapping system

* Fix auto device map

* Add examples

* Add model card

* Typo

* Remove superflous logging

* Fixes for Llama 4 UQFF loading (EricLBuehler#1275)

* Support sharding for UQFF (EricLBuehler#1276)

* Serialize sharded uqff files

* Loading

* Fix base64

* Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278)

* Support the DeepCoder model (EricLBuehler#1279)

* Add faq for metal not found

* Improved PagedAttn scheduling accuracy (EricLBuehler#1282)

* Scheduler ops by reference

* Ensure scheduler gets correct prompts

* Fix cuda build for copy_blocks

* Fixes for scheduling image seqs with pagedattn (EricLBuehler#1283)

* update to llguidance 0.7.16 (EricLBuehler#1284)

* update llguidance to 0.7.16 from crates.io; use ParserFactory

* add lark_llg.py example

* use new llguidance::Matcher APIs

* rework spec-decoding with llg

* more work on spec sampling

* check for parser stop

* fix clippy

* remove unneeded rollback

* update build_llg_factory to return Result

* Update dependencies (EricLBuehler#1286)

* Much faster image inputs processing (EricLBuehler#1289)

* Add more SDPA head dims for much faster SigLIP (EricLBuehler#1290)

* More sdpa head dims, faster vision models

* Move nonzero to above for faster metal synch

* Doc

* Update valid head dims

* Show throughput in interactive mode (EricLBuehler#1291)

* Update interactive mode throughput stats

* Accurate prompt t/s

* Accurate prompt t/s for usage

* Unify bitwise operations (EricLBuehler#1288)

* Unify bitwise ops

* Tests pass

* Fix cuda build

* Clippy

* Multimodal prefix caching support! (EricLBuehler#1298)

* Initial progress

* Support vision prefix caching

* Update docs

* Add multimodal data abstraction

* Interactive mode improvements (EricLBuehler#1299)

* More ergonomic image url parsing

* Add option to clear

* Add the Qwen 3 and Qwen 3 MoE models! (EricLBuehler#1285)

* Add qwen3 model

* Add enable_thinking

* Add initial qwen3 moe

* Add the moe model

* Format

* Fix order of norm

* Fix expert shapes

* Fix reverse

* Fix norm device for isq

* Fix nonzero when no nonzero

* Moe model runs

* Working qwen3 moe

* Add metal fp8 blockwise dequant

* Clean

* Typo

* Enable tool calling

* Streamlined ux

* Add some examples

* Add docs

* Fix dead link

* Remove interactive mode max_len

* Update QWEN3.md

* Hotfix for vision mode clear

* Revamped and streaming web search support (EricLBuehler#1301)

* Streaming web search

* Refactor a bit

* More refactoring

* Add some logging, parallelize some things

* Allow url

* Suppress warning, allow multi-turn searching

* Batch compute_similarities

* Cap content len

* Typos

* Doc

* Handle vision messages or different tool call prefixes (EricLBuehler#1302)

* Fix cuda

* Tune web search budget

* Simplify prefix cacher (EricLBuehler#1305)

* Use rustyline to handle non-ascii in interactive mode (EricLBuehler#1306)

The io::stdin().read_line() cannot handle non-ascii input, which caused
crash when use backspace to delete non-ascii characters.

Introduce rustyline to the interactive mode to solve the problem. Plus
it can bring more editing features in the future.

Close EricLBuehler#1140

* Add more tools for automatic search (EricLBuehler#1307)

* Add interactive mode history

* Add a website extraction tool

* Pass toks by reference

* Optimize prompt chunking

* Fix CPU hogging in interactive mode (EricLBuehler#1309)

The log enabler should be checked after the sleep instead of a busy
loop checking.

Since the interactive mode always disables the token speed logger, 100%
CPU was taken by this loop always.

* Add Metal precompilation support  (EricLBuehler#1311)

* Add metal precompilation for paged attn

* Add for mistralrs-quant

* Better constructor

* Dont always build

* Fix name for paged attn rebuild

* Reduce thrashing of Metal autorelease (EricLBuehler#1313)

* Reduce calls to autorelease

* Optimize clone_in_cache

* Refactor float8

* make `AdapterPaths` and `LoraAdapterPaths` public (EricLBuehler#1314)

Make `AdapterPaths` and `LoraAdapterPaths` public so `LocalModelPaths`
can be constructed outside of `mistralrs-core`.

* Refactor KV cache manager (EricLBuehler#1315)

* Refactor kv cache

* Refactor caches

* Fix some overflows

* Add `Audio` and `Speech` model categories (EricLBuehler#1317)

* add `Audio` to `ModelCategory`

* add `Speech` to `ModelCategory`

* fix to go back to PartialEq having an exhaustiveness check

* Remove has_conv2d from vision model API (EricLBuehler#1318)

* Unified/automatic flash attention enabler (EricLBuehler#1319)

* Remove from sdpa params

* Fix errors

* No warnings

* Log

* Clippy

* Fix cublaslt 4d mask (EricLBuehler#1320)

* Fix cublaslt 4d mask

* Clippy

* Keep caches on gpu

* Qwen VL models fixes (EricLBuehler#1322)

* Add some defaults

* Fix

* Fix one thing

* 2.5 vl works

* Use caching again

* Fix v2

* Move index inside loop

* Offset in ropeidx

* Default support for vision prefix caching is false

* Fixes for all vision models (EricLBuehler#1323)

* Fix phi input processor?

* Fix phi input processor

* Handle no_prefix_cache from pipeline

* Phi models confirmed 👍

* Fixed for phi inputs processors

* Fixed for phi4

* Llama 3 confirmed 😀

* Mistral 3 confirmed 😃

* Idefics 2/3 fixes

* Some fixes

* Remove unsafety

* Improved+faster LRU prefix cacher (EricLBuehler#1321)

* Show TTFT

* Use LRU prefix cacher

* Faster prefix cacher

* Inplace ISQ support and default to mmap (EricLBuehler#1277)

* Initial impl of immediate isq

* Immediate isq -> !loading_isq

* Varbuiler utils always using mmap!

* Log

* Add for packed experts

* Afq without copy

* Clarify

* Clippy

* Apple immediate isq

* Better logic for loading_isq

* Support showing ttft

* Rename

* Shared quantize guard

* Parallel progress bar

* Parallel loading for progress bars

* Actual ISQ support

* Conditional parallelism for NiceProgressBar

* Use conditional iterator

* Warn once

* Predicate for applying immediate isq

* Allow parallel

* Remove debug print

* Remove debug print

* Remove debug print

* Fix typos (EricLBuehler#1329)

* Fix Idefics 3 arch chat templating (EricLBuehler#1330)

* Update inputs merger

* Fix

* Better warning

* Better warning

* Better warning

* Nonzero ahead of time

* No f32

* Clippy

* Optimize get_logprobs

* Fix packed experts

* Update masking

* Use Sdpa in idefics3

* QuantMethod in idefics3 vision

* Remove a .contiguous

* Remove two space from PR comment (EricLBuehler#1331)

* Add automatic vision loader type (EricLBuehler#1332)

* Add automatic vision loader

* Remove references to --arch

* Update examples

* Add the Dia 1.6b TTS model! (EricLBuehler#1304)

* Add loading

* Add rope, mlp, most of attn

* Add encoder + encoder layer, decoder layer forwards

* Add decoder forwards

* Add prepare_audio_prompt

* prepare_generation mostly done

* Add a proper dia kvcache

* Add most of decoder_step

* Add the sampler

* Add the generation loop

* Wire things up

* Add speech pipeline

* Fixes

* Loads

* Some fixes

* f32

* Some progress

* Ok it runs upto dac decoding

* Add dac part loading

* Loads and runs at least

* Remove encodec

* Debugging

* Debugging

* Huh

* Complete merge

* Interactive

* Confirmed dac works at least

* Looks like encoder works

* Much progress

* Hmm

* Sampling

* Almost there

* Sampler

* Sampler

* Bf16 support

* Response

* Use it in interactive mode

* Fix oneshot

* Add openai api

* Add openai api

* Refactor loading

* Use naive sdpa for inplace

* Factor out

* Clippy

* Clippy

* Config

* Refactor config

* Metal clippy

* Fix t/s

* ISQ support

* Some fixes, nits

* Fix cuda

* Clippy

* Inhibit cublaslt for cuda

* Add server example

* Add python example

* Add rust api

* Add docs

* Update config.toml

* Fix .pyi

* Update readme

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* config.toml tweak

* update `llguidance` to `0.7.20` (EricLBuehler#1334)

Update `llguidance` from `0.7.16` to `0.7.20` so that it has guidance-ai/llguidance#172 which is a fix for building on GCC 15.

* Add model category <> messages check (EricLBuehler#1335)

* Verify model category matches the messages

* Add vision chat

* Fixes

* Add element-wise normalization check (EricLBuehler#1340)

* Fix streaming example print statement (EricLBuehler#1339)

* Fix normalization formula in comment (EricLBuehler#1338)

* Fix image_to_pixels to handle non-RGB images (EricLBuehler#1337)

* Fix typo in expect messages (EricLBuehler#1342)

* Don't use mmap on cuda (EricLBuehler#1336)

* No mmap on cuda

* Simplify streaming tool call logic

* Remove debug

* Support AWQ format models (EricLBuehler#1350)

* Support AWQ format models

* Clippy fix

* Fix uqff dummy layer ISQ application (EricLBuehler#1351)

* Disable immediate isq if write_uqff (EricLBuehler#1352)

* Fixes for UQFF loading on CUDA, ISQ pack factor (EricLBuehler#1354)

* Fix logic for uqff on cuda

* Updated pack_factor

* Refactor Option references for model paths (EricLBuehler#1347)

* refactor: use Option refs in model path helpers

* Format

* Add a script for server benchmarking (EricLBuehler#1355)

* Serde alias

* Fix

* Update for tie_word_embeddings

* Print running/waiting

* 30 users

* Update num_users

* Update dummy paged attn

* Optimized Metal qmv_fast path (EricLBuehler#1356)

* Compile with lto

* Tweak profiles

* New, fast sampler for Metal! (EricLBuehler#1327)

* Show TTFT

* Use LRU prefix cacher

* Faster prefix cacher

* A bit of gpu sampling

* Minp but cpu for now

* Metal fast cumsum impl

* Sampling with fast topp kernel

* Hmm not perfect

* Add metal sort kernels

* Tmp

* Add single block sort

* Add most of multi block sort, just need copy op

* Add copy kernels

* Expose kernels

* Add a test

* Ok it works

* Structure things

* Add caching

* Rename

* Cpu is default

* CUDA case

* Topk

* Refactor Option references for model paths (EricLBuehler#1347)

* refactor: use Option refs in model path helpers

* Format

* Add a script for server benchmarking (EricLBuehler#1355)

* Serde alias

* Fix

* Update for tie_word_embeddings

* Print running/waiting

* 30 users

* Update num_users

* Update dummy paged attn

* Optimized Metal qmv_fast path (EricLBuehler#1356)

* Compile with lto

* Tweak profiles

* Fix topk

* Penalties

* Add logits processor, clippy fixes

* Fix chat port

* Remove warning

* Fix chat port

* Fix metal parallel sampling (EricLBuehler#1357)

* Cpu if parallel for now

* Tweak bench script

* Add immediate isq predicates for qwen3 (EricLBuehler#1358)

* Add immediate isq predicates for qwen3

* Fix parsing of "parse_isq_value" depedent of device

* Typo

* Fix gemma3 logging

* Regressions fixes (EricLBuehler#1359)

* Fix regression for mmap

* Revert EricLBuehler#1321

* Refactored matching_cache impl

* Clippy

* Revamped and smaller readme (EricLBuehler#1360)

* Expandable detail sections

* Refactor using derivative model

* Tweak quick examples

* Update llama

* Update llama

* Supported accelerators is a table

* Update installation guides

* Tweak apis

* Remove --port in quick examples

* Add demo gif

* Add gif in readme

* Update demo gif

* Update demo gif

* Update demo gif

* Add gif in readme

* Add gif in readme

* Add a web chat app! (EricLBuehler#1362)

* Initial

* Markdown

* Copy code

* Add model loading sidebar

* Support vision models

* Tweak isq

* Links go to another page

* Clear when switch model

* Fix html tags

* Add image support!

* More then one images

* Fix

* Improved textarea

* Tab for switching between vision and text

* No paged attn for now

* Prettier format

* Multiple models at once

* Better switching, clearing ability

* Mobile support

* Inline markdown parser

* Update examples

* Typos

* Support specifying isq

* Fix mobile

* Fixes

* Fix button on mobile

* Image height is capped

* Thumbnail

* Fix rotating kv cache edge case

* Add drag and drop for images

* Small things

* Sidebar is frozen now

* Better listner

* Add readme

* Tweak readme

* Add chat history support to web chat app (EricLBuehler#1363)

* Add chat history

* Support renaming

* Start immediately with new chat

* Add timestamp

* Prettier chat list

* Style

* Delete chat

* Fix copy button

* Fix markdown rendering

* Store things in cache

* Store things in cache

* Refactor web chat, fix multichat image restore (EricLBuehler#1364)

* Fix multichat image restoration.

* Clippy

* Refactor

* Refactor frontent

* Fix repeated immediate isq init (EricLBuehler#1365)

* Add images_ref

* Add debug impl

* Fix the bug

* Tweak style of buttons

* Add a spinner

* Move spinner

* Tweak emoji

* Add gif

* Tweak initial gif

* Include vision tower tensors in Mistral3 UQFF (EricLBuehler#1366)

* Fix mistral 3 uqff resitdual tensors for vision

* Rolling shard creation for uqff files (EricLBuehler#1367)

* Fix occasional unstability during isq of afq (EricLBuehler#1368)

* Fix unstability during isq of afq

* Clippy

* Fix web chat installation

* Support web chat file uploading (EricLBuehler#1370)

* Web chat fixes

* Fix thumbnail in message, reuse blank chat

* Add file uploading support

* Fix scroll

* Allowed extensions

* Preserve files as literals

* Support multiple clients

* Add a stop button

* New cache dir

* New cache dir

* Fix

* Refactor

* Update readme

* Tweak drag-and-drop css

* Add speech generation support to the web chat! (EricLBuehler#1373)

* Initial speech gen support for web chat

* Tweak ui

* Update docs

* Prefix caching for PagedAttention! (EricLBuehler#1369)

* Exposing some things for logical token blocks

* Prefix cache manager has the scheduler

* Refactor

* Get logical and physical blocks into the prefix cacher

* Hash and cache

* Pass physical block prefill

* Allocation of prefilled block tables

* Temp

* Dont always use 2

* Hmm

* Hmm

* It mostly works

* Increment refcount

* Support images!

* Add to dummy paged attn

* Fix some clippy

* Clippy

* More checks

* Include EricLBuehler#1371, closes EricLBuehler#1371

* Typos

* Update docs

* Metal PagedAttention accuracy improvements (EricLBuehler#1374)

* Fix subtle bug

* Fix half sum bug

* Format metal paged attention

* Handle images in paged attn scheduler (EricLBuehler#1375)

* Include schemas needed for chatcompletions endpoint (EricLBuehler#1353)

* EricLBuehler#1326: WIP include schemas needed for chat completions endpoint

 Conflicts:
	Cargo.lock
	mistralrs-server/src/main.rs

* EricLBuehler#1326: WIP define utoipa as a workspace dep since core and server both need it

* EricLBuehler#1326: first draft of handling schemas that use Either

* EricLBuehler#1326: first draft of handling schema for Grammar

* EricLBuehler#1326: Add in other endpoints to API docs.

* EricLBuehler#1326: Adjust code comments

* EricLBuehler#1326: Implement coderabbitai suggestions

- EricLBuehler#1353 (review)
- EricLBuehler#1353 (comment)

* Fix constraints with metal sampler

* Revert EricLBuehler#1375

* Fix case where prefix cacher returns no toks (EricLBuehler#1377)

* Fix AFQ UQFF serialization

* Faster UQFF serialization (EricLBuehler#1379)

* Faster UQFF serialization

* Fix uqff gemma3

* Improve gemma3 auto loader names

* UQFF creation for AFQ on CPU support (EricLBuehler#1380)

* Add afq cpu quantize/dequantize

* Clippy

* Improved device for afq quantize

* Improved dtype handling for cpu afq (de)quantize

* Improved generate_uqff_card

* Add fused CPU attention kernel! (EricLBuehler#1382)

* Working

* Fix warnings

* Allow mask

* Support bf16, f16

* Handle striding

* Parallelized

* Add initial vector flash attn

* Avoid repeated allocations

* Tiled kv

* Apply some clippy

* Some small fixes

* Chunked vec_dot

* Clipy

* Use T::zero

* Refactor attention backends (EricLBuehler#1384)

* Refactor attention code

* Refactor attention code

* Move into backends

* Set macOS thread affinity for CPU attn (EricLBuehler#1385)

* Use lazylock

* Format

* Fix metal warn build

* Faster Qwen 3 MoE support on Metal (EricLBuehler#1387)

* Fix load

* Use afq gather qmm

* Well it runs

* It works

* Polish

* Fast and slow options

* Remove quantized.rs

* Polish some more

* Refactor

* Add isq

* Update load in parallel

* Support fp8

* Refactor for FusedExperts

* Clippy

* Handle pack factor when loading prequantized models

* Use f32 only in moe

* Avoid using f32 so much

* Avoid using f32 so much

* Fix PagedAttention block leaks (EricLBuehler#1388)

* Warn and ignore if ignored

* Fix a block allocation leak

* Update bench.py

* Fix double free in block engine

* Do not apply ISQ if loading a prequantized model

* Fix cuda build again (EricLBuehler#1389)

* Fix cuda build

* Fix

* Format

* Fixes for cuda docker

* Update dockerfiles

* Bump version to 0.6.0 (EricLBuehler#1390)

* Bump version to 0.6.0

* Remove lower_level api

* Make a static dir

* Update deps

* Fix routing for static handler in web chat

* Fewer .contiguous calls for qwen3 moe (EricLBuehler#1391)

* Allow speech models to accept batched inputs (EricLBuehler#1393)

* Allow speech models to accept batched inputs

* Clippy

* Ring distributed backend for heterogeneous TP (EricLBuehler#1238)

* Begin work on ring distributed backend for Metal

* Add the actual ring functionality

* It loads and kind of runs

* It works

* Optimize buffer allocation

* Avoid copy

* It works

* Add allgather

* Fix load

* Ping-pong

* Small things

* Add config json

* Allow different ip address

* Read config once

* Read config when appropriate

* Replicate requests

* Small fix

* Fix small compat with openai

* Clippy

* Update docs

* Add deepseek tool calling chat template

* Add auto loader for vision/text detection! (EricLBuehler#1402)

* Add auto loader for vision/text detection

* Build fixes

* Add model loader

* Update docs

* Format

* Create Mistral.rs Server Core Lib: `mistralrs-server-core` (EricLBuehler#1346)

* First draft of exposing mistral server routes as lib

* make arg struct fields pub

* Take base path so utoipa swagger route can properly redirect

* Expose swagger routes and make it configurable

* Add base path option for swagger docs

* More work on modularizing mistralrs server

* Sync fork (+1 squashed commit)
Squashed commits:
[169ae9e] Sync fork

* Adjust fn params to use refs / individual params instead of args

* Start breaking down controller actions into smaller pieces

* Continue refactoring

* Make mods pub so they can be used outside crate

* Allow chat completion streamer to take a callback so that you can get the complete response when finished

WIP (+3 squashed commits)
Squashed commits:
[0061d87] WIP
[c484d56] WIP
[16f8a60] WIP

* Sync fork

* Adjust callback type

* Remove throughput_log arg that was removed in 26afcc3

* Implement defaults for Args (and use for Clap)

* Small code formatting tweaks

* Rename callback to match SSE event and code clean up

* Sync fork

* WIP: first very rough draft of server core builder. Doesn't meet parity with old functional approach yet (slower / unstable?).

* Clean up (+4 squashed commits)
Squashed commits:
[e1cff387] Sync fork
[d8301025] WIP debugging
[1ea9f8c8] Sync fork
[4fe28cf5] WIP: debug function

* WIP server core builders

* Code clean up

* Add on_chunk callback

* Code clean up

* First draft of creating version of mistral-server that uses server-core

Code clean up (+1 squashed commit)
Squashed commits:
[adea1693]

* Sync fork

* Add helper methods to builder to make optional args more ergonomic (since .build validates params)

* Start adding docs

* Start cleaning up crates deps

* Example commit of mistral-server with implementing server-core

* Start addressing CodeRabbit feedback

* Fix comment typo

* Tweak doc blocks

* - Update type alias naming for clarity (MistralRs instead of Mistral)
- CodeRabbit, don't use eprintln for lib (use trace)
- Allow buffer size to be passed in and default to Constant
- Allow router body limit to be passed in and default to Constant
- Update doc examples

* Typo

* Address CoderRabbitAI feedback

* Support linear rope for llama3 (EricLBuehler#1408)

* Hotfix for loading

* Fix vllama4 uqff loading (EricLBuehler#1409)

* Fix vllama4 uqff loading

* Fix regex

* Fix regex

* Maybe a fix

* Gracefully handle receiver disconnects (EricLBuehler#1410)

* Handle receiver disconnects

* Format

* Fix Qwen3 MoE device mapping irregularities (EricLBuehler#1411)

* Fix bias

* Fix lm_head packing case

* Account for gate

* Fix head dim

* Fix interactive mode URL parsing (EricLBuehler#1412)

* fix url regex in vision interactive mode

* Fix regex

* Clippy

* Refactor auto device map (EricLBuehler#1413)

* Refactor auto device map

* Refactor a bit more

* Clippy

* Enable runtime sampling tweaks in interactive mode (EricLBuehler#1414)

* Document runtime sampling commands

* Fix readme

* Tweak

* Bounds checking

* Tweak temp bounds

* Send streaming tokens every time

* Gumbel sampling for fast sampler (EricLBuehler#1416)

* Improved handling for initialize_logging

* Improved CPU flash attention accuracy & performance (EricLBuehler#1417)

* Downcast correctly

* Operate internally in f32

* Avoid some casts and striding

* Prefetch

* Provide chat_templates to container users (EricLBuehler#1419)

Models often come without chat templates requiring mapping them
from the source repository into a container for access by the
mistralrs-server.

Copy the templates from the build tree into the root of the image
to permit use via `--chat-template /chat_templates/something.json`

TODO:
  With the increase in quantized models and support for other
formats, the initial benchmark run during model load can be used
to qualify/select existing chat templates embedded into the binary
for models which do not come with any (to include output of the
functional failures in each test allowing users to modify the
ones already provided correctly to suit the model being loaded).

Co-authored-by: RageLtMan <rageltman [at] sempervictus>

* Faster cpu flash attn (EricLBuehler#1418)

* Faster cpu flash attn

* Prefetch

* Clippy

* Add some tests

* Add softcap tests

* Fix test_parse_image_url test

* Update tests

* Update tests

* Web search improvements (bm25, web chat) (EricLBuehler#1420)

* Fix web search blocking case

* Web search support in web chat

* Tweak ui

* Support fallback to bm25

* Clippy

* Reinject descriptions

* Propely handle consecutive searches (EricLBuehler#1421)

* Update extraction tool reinjection

* Looped

* Update docs (EricLBuehler#1422)

- lib.rs: clean up example var names and match logging change from EricLBuehler@201d6be
- server_builder: fix typo
- READMEs: link to crate docs

* Better tool call detection logic (EricLBuehler#1424)

* Add web search hook callbacks (EricLBuehler#1426)

* feat: add customizable search hook

* Move to builder

* Update docs

* Fix CUDA context switching, bind thread on CudaStorage drop (EricLBuehler#1428)

* Add CUDA context helper and use in Llama forward

* No flashparams?

* working

* Tweak

* Update to use dep

* conditionally build flash attention inputs (EricLBuehler#1429)

* Add AGENTS.md (EricLBuehler#1430)

* Support Qwen3 GGUF model (EricLBuehler#1432)

* Support QWen3 GGUF model

* Clippy fix

* cargo fmt

* Improved paged attn prefix caching (EricLBuehler#1434)

* Improved paged attn prefix caching

* Disable

* Clippy

* Temporary fix for qwen3 gguf tokenizer (EricLBuehler#1433)

* Temporary fix for qwen3 gguf tokenizer

* Typo fix

* Add tool callback support (EricLBuehler#1427)

* Add tool callback support

* Fixes

* Support named tool callbacks

* Update examples

* Update docs

* Clippy

* Centralize crate dependencies (EricLBuehler#1438)

* chore: centralize dependencies

* Format

* Fix bug in tokenizer created with gguf metadata (EricLBuehler#1440)

* Fix bug in tokenizer created with gguf metadata

* Clippy fix

* Update deps (EricLBuehler#1441)

* Small things

* Update deps

* Update deps

* Update breaking changes

* Doc fixes (EricLBuehler#1442)

* Mention uqff_maker

* Downgrade rustyline 16.0.0 -> 15.0.0 (EricLBuehler#1444)

* Add max_completion_tokens alias for server (EricLBuehler#1451)

* Audio input support (Phi 4 multimodal) (EricLBuehler#1448)

* Deps

* Add conformer

* Nemo loading

* Position embeds

* Load t5 attn bias

* Attn and feed forward

* Add conv module and glu pointwise

* Implement relative attn bias

* Add the forward methods

* Add encoder embedding

* Fix oproj

* Some loading

* Conformer loads!

* Fully loading speech stack

* Merger

* Dont need that

* First pass at audio processing

* Read samples

* Optional

* Small loading fix

* Runs but not correct yet

* Improved audio processing?

* Works with this

* Fix t5 attn bias

* It works!

* Comment

* Use some other crates

* Clippy

* Allow bf16 on metal

* Add prefix_audio

* Remove unused

* Typo

* User specified

* Add audio url parsing

* AudioProjectionMode -> InputMode

* Audio prefix caching

* Fix bug in audio prefix caching

* Support both at the same time!

* Tweak logging

* Support stereo

* Add mistralrs-audio

* Support batching

* Add server and rust api example

* Add python api

* Fix add_multimodal_message

* Fix unfold for conformer

* Streaming example

* Add web chat support

* Add modalities registry

* Fix offline cache issue for gguf models (EricLBuehler#1452)

* Add MCP server endpoints (EricLBuehler#1453)

* feat(server): add MCP server support

* Add mcp docs

* Add handle_list_tools_request

* Better launch, tool handling

* Tmp state

* Ok works

* Handle modalities

* Update docs

* Add ping

* Tweak temperature bounds, args

* MCP documentation pass (EricLBuehler#1455)

* Fix table

* Update mcp docs

* Improve readme header

* Improve readme header

* Integrate an MCP client (EricLBuehler#1456)

* Add builtin mcp client

* Use async loader

* Add headers

* Handle sse

* More flexible search request

* Add tool callbacks with tools, for mcp

* Add bearer token support

* Add websocket support

* Update docs

* Add python api

* Clippy

* Add http api, docs

* Tests pass

* Make these configs actually work

* Add docs

* Make mistralrs-mcp

* Refactor examples

* Update examples

* Add defaults

* Add defaults

* Add defaults

* Update docs

* Improved docs

* Add -y to npx usages

* Even better examples

* Update generate_wheels

* Update generate_wheels

* Update generate_wheels

* Fix Dockerfile.cuda-all

* Improve automatic tool call (EricLBuehler#1460)

* Improved auto tool call

* Add logging

* chore: `Dockerfile.cuda-all` configurable threads (EricLBuehler#1458)

* chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (EricLBuehler#1459)

* Add fallback definition for isnan (EricLBuehler#1463)

* chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465)

* chore: Dockerfile - Remove rayon threads env

* chore: Dockerfile - Improve formatting for `apt-get`

* Remove duplicate calls for api_dir_list (EricLBuehler#1474)

* Remove duplicate calls for api_dir_list

* Support local cache for api_dir_list

* Fix home folder for metal

* Capitalized

* Fix transient pyo3 dep (EricLBuehler#1478)

Co-authored-by: Eric Buehler <[email protected]>

* Fix objc dep with non macos (EricLBuehler#1480)

* Fix phi 3/4 + nccl issue (EricLBuehler#1481)

* Fix log

* Fix n kv heads

* Fix phi3.5 moe (EricLBuehler#1482)

* Fix phi3.5 moe accum device

* Fix again

* Fix again

* Support GLM4 model! (EricLBuehler#1437)

* Support GLM4 model

* Mention GLM4 model in ReadMe

* glm4 type hint

* Typo fix

* Fix unsupported chat_template function

* Clippy fix

* Refactor distributed backend (EricLBuehler#1484)

* Refactor distributed backend, check power of 2

* Fix compilation

* Cap metal paged attn kv allocation (EricLBuehler#1485)

* Better paged attn metal cap (EricLBuehler#1486)

* Better paged attn metal cap

* Small fix

* Comment

* Small fix

* Refactor

* Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423)

* Start working on consolidating completion and chat_completion underlying implementations

* Move response channel to util mod for now (since it's used with streaming and non streaming)

* More work on consolidating completions and chat completions

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* Update docs and restrict completion core visibility

* CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this

* Use consistent var name for completions mod

* Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub
Make lib.rs example compile checked and update example

* Code formatting

* Typo

* Sync fork

* Sync fork

* Docs example fix

* Support qwen3 gguf (EricLBuehler#1488)

* Add qwen3 gguf

* Template fixup

* Make bos/eos token IDs optional (EricLBuehler#1493)

* Remove python deps from CUDA dockerfiles (EricLBuehler#1487)

* Handle noncontiguous v in naive_sdpa (EricLBuehler#1499)

Co-authored-by: Eric Buehler <[email protected]>

* Server Core: refactor Paged Attention configuration (EricLBuehler#1500)

* Use StorageModePrivate for Metal PA kv cache (EricLBuehler#1506)

* Fix OpenAI stream: emit field in tool-call deltas for schema compliance (EricLBuehler#1507)

* FP8 KV-cache quantization for PagedAttention (EricLBuehler#1400)

* Add most of paged attn kv quant

* It builds a bit

* All the functionality at least

* Small fix

* Add a scale

* Fix bf16 usage

* Make k_v_scale optional

* Collector

* Tweak collection

* Refactor

* Add to apis

* Add cuda impl

* Fix compilation

* Fixes

* Handle ENABLE_FP8

* Format

* Tweak

* Fix scaled_convert usage

* Fix cache_t size

* Fixed scale collection

* Actual fix

* Fix fp8 for CC<8

* Fix the usual String != &str bit (EricLBuehler#1483)

Co-authored-by: RageLtMan <rageltman [at] sempervictus>

* chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465)

* chore: Dockerfile - Remove rayon threads env

* chore: Dockerfile - Improve formatting for `apt-get`

* Remove duplicate calls for api_dir_list (EricLBuehler#1474)

* Remove duplicate calls for api_dir_list

* Support local cache for api_dir_list

* Fix home folder for metal

* Capitalized

* Fix transient pyo3 dep (EricLBuehler#1478)

Co-authored-by: Eric Buehler <[email protected]>

* Fix objc dep with non macos (EricLBuehler#1480)

* Fix phi 3/4 + nccl issue (EricLBuehler#1481)

* Fix log

* Fix n kv heads

* Fix phi3.5 moe (EricLBuehler#1482)

* Fix phi3.5 moe accum device

* Fix again

* Fix again

* Support GLM4 model! (EricLBuehler#1437)

* Support GLM4 model

* Mention GLM4 model in ReadMe

* glm4 type hint

* Typo fix

* Fix unsupported chat_template function

* Clippy fix

* Refactor distributed backend (EricLBuehler#1484)

* Refactor distributed backend, check power of 2

* Fix compilation

* Cap metal paged attn kv allocation (EricLBuehler#1485)

* Better paged attn metal cap (EricLBuehler#1486)

* Better paged attn metal cap

* Small fix

* Comment

* Small fix

* Refactor

* Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423)

* Start working on consolidating completion and chat_completion underlying implementations

* Move response channel to util mod for now (since it's used with streaming and non streaming)

* More work on consolidating completions and chat completions

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* More WIP consolidation of server core handlers

* Update docs and restrict completion core visibility

* CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this

* Use consistent var name for completions mod

* Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub
Make lib.rs example compile checked and update example

* Code formatting

* Typo

* Sync fork

* Sync fork

* Docs example fix

* Support qwen3 gguf (EricLBuehler#1488)

* Add qwen3 gguf

* Template fixup

* Make bos/eos token IDs optional (EricLBuehler#1493)

* Remove python deps from CUDA dockerfiles (EricLBuehler#1487)

* Handle USE_FP8 for cuda

* Fix cuda warn

* Add readme

* Saturating sub in sequence state

---------

Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: RageLtMan <[email protected]>
Co-authored-by: Brennan Kinney <[email protected]>
Co-authored-by: Guoqing Bao <[email protected]>
Co-authored-by: Matthew Haynes <[email protected]>

* Validate model name in OpenAI API (EricLBuehler#1509)

* Validate model name in openai api

* Add docs, allow 'ignore'

* Updated examples for EricLBuehler#1509

* Fix mcp import in doc string (EricLBuehler#1510)

* Add multi-model support! (EricLBuehler#1512)

* Refactor MistralRs

* Working multi-model!

* Add mutli-model docs initially

* Update mistralrs-pyo3, mistralrs-bench, mistralrs

* Update apis for consistency

* API tweaks

* Logging tweaks

* Add examples, tweak cli

* Clearer pipeline id

* Fix config key semantics

* Format and clippy

* Tweak logging, fix example

* Clippy refactor

* Update examples

* Remove unused multi model docs

* Replace 'ignore' with 'default'

* Update docs

* Add stars label to readme (EricLBuehler#1513)

* Add CLAUDE.md

* Handle base_model.model case in lora (EricLBuehler#1514)

* Add thread_local! for engine-specific const/static (EricLBuehler#1517)

* Fix MCP doc test (EricLBuehler#1511)

* Allow disabling metal precompilation (EricLBuehler#1518)

* Allow disabling metal precompilation

* Simple preprocessor

* Simple docs

---------

Co-authored-by: Eric Buehler <[email protected]>

* Rust 1.88 clippy (EricLBuehler#1522)

* Rust 1.88 clippy

* Format

* Fix cuda warnings (EricLBuehler#1526)

* Avoid panic decoding tokens on error (EricLBuehler#1527)

* Split Marlin and Paged Attention kernels for faster build (EricLBuehler#1525)

* Split Marlin and Paged Attention kernels for faster build

* Typo fix

* chore: update llguidance (EricLBuehler#1535)

* chore: update llguidance

* chore: remove unused import

* Add the SmolLM3 model! (EricLBuehler#1501)

* Add model

* Update loader

* Fix llama config usage

* Docs

* Fix config no_rope_layers

* Fix tie_word_embeddings default

* Add chat template

* Embed the chat templates

* Fix embedding template

* enable_thinking default true

* Update examples

* XML tools for smollm3

* Add smollm3 docs

* Fix openai examples

* Clippy

---------

Co-authored-by: Eric Buehler <[email protected]>

* Add full Gemma 3n support! (EricLBuehler#1519)

* Add initial

* Loading for text model

* Add ple embeddings

* Add altup, laurel block

* Update rmsnorm

* Add mlp

* Update attn norm application

* Currently no kv shared

* Wire it up

* It runs

* Fix bf16

* Fix scaled embd

* Fixes for mean

* tmp

* Attn confirmed

* Fix target_magnitude

* Add shared kv

* Ok it works

* Remove npy

* Fix streaming

* Remove warnings

* Remove paged attn

* Refactor rope

* Add immediate isq

* Add vision & mproj

* Update image processor

* Vision merge runs, not correct

* Remove

* Add mobilenet v5

* Add multimodal vision embedding

* Fix load

* runs

* Fix gamma

* Works but just not vision tower

* It works!!

* Tweak

* Fix warnings

* Move vision tower

* Fix warn

* Update cache manager things

* Refactor

* Add audio model, it loads

* Add audio processing

* It runs at least

* tmp

* A bit better

* Audio works!!!!

* Fused attn in vision

* Clippy

* Update audio runner

* Optimized audio model

* Remove unused things

* Fix inputs processor bug

* Remove comments

* Clippy

* Small optimizations

* Format

* Correctly register modalities

* Add docs

* Update readme

* Runs there

* Fixed padding from Blaizzy/mlx-vlm#410

* Add better checks

* Fix sdpa n_kv_groups

* Vision encoder works!

* Rotate image

* Clippy

* Fix cuda loading

* Updated device mapper

* Fix overflow

* Fix dtype errors

* Refactor image/audio embeddings

* Fix metal

* Fix dtype mismatch

* Audio processing fixes

* Audio processing fixes

* Works

* Audio is good

* Fix boi/eoi too

* Embed the chat templates

* Better embedding accuracy in non f32

* More f32

* Support bf16 on metal

* Add more ISQ

* Fixed device map

* Clippy

* Gemma3n no paged attn

* Fix saturating sub

* Faster rmsnorm

* Use sdpa for vision model

* Fix ple bug

* Fix name

* Fix multiaudio

* Add matformer config loading

* Add docs

* Add support for matformer in auto device mapper

* Update docs

* Typos

* Tweak

* Tweak

* Fix multidevice

* Fix gemma3n text model auto device map

* Fix dims3

* Fix auto devic emap vision

* Non-metal keeps PLE on cpu

* Complete merge

* Vision dtype f16 -> f32

* Fix metal nm device

* Fix uqff

* Typos

* Reference uqff

* Fix tests

* Fix sequence length check (EricLBuehler#1546)

* update candle version (EricLBuehler#1545)

Co-authored-by: AlpineVibrations <[email protected]>

* add ios target to metal deps (EricLBuehler#1548)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: edwko <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Guoqing Bao <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: Chen Mulong <[email protected]>
Co-authored-by: Steph Wolski <[email protected]>
Co-authored-by: omahs <[email protected]>
Co-authored-by: Viktor Szépe <[email protected]>
Co-authored-by: Matthew Haynes <[email protected]>
Co-authored-by: RageLtMan <[email protected]>
Co-authored-by: Brennan Kinney <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: Sbargaoui <[email protected]>
Co-authored-by: Gaétan Lepage <[email protected]>
Co-authored-by: Ammar Elsabe <[email protected]>
Co-authored-by: luke <[email protected]>
Co-authored-by: AlpineVibrations <[email protected]>
Co-authored-by: Michael Tissen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant