Skip to content

Commit 91a5ad7

Browse files
JeadieEricLBuehlerdependabot[bot]brrretiennebalit
authored
Updates from EricLBuehler/mistralrs (#27)
* Refactor NCCL device mappers (EricLBuehler#1172) * Bump ring from 0.17.11 to 0.17.13 (EricLBuehler#1179) Bumps [ring](https://github.com/briansmith/ring) from 0.17.11 to 0.17.13. - [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md) - [Commits](https://github.com/briansmith/ring/commits) --- updated-dependencies: - dependency-name: ring dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * DSV3/R1 fixes (EricLBuehler#1173) * DSv3 fixes * Just save the progress * Fix launch of blockwise fp8 dequant * It actually works * Async ops * Optimize non-mla with cat * Fix non-cuda build * Update build * Add more CUDA_CHECK * Works really now * Working fully now with pagedattn * Format everything * Fix diffusion device mapping (EricLBuehler#1187) * Internal abstraction for distributed op (EricLBuehler#1188) * Make Sequence::set_toks more safe (EricLBuehler#1190) * Fix CI tests out of storage (EricLBuehler#1191) * Internal abstraction for distributed op (EricLBuehler#1189) * Fix build_cuda_all.yaml CI (EricLBuehler#1193) * Support tensor parallelism for vision models! (EricLBuehler#1194) * Refactor distributed mapper prep * Support vision model TP * Update docs * Add vision model TP for mllama * Always pass _USE_MATH_DEFINES for CUDA (EricLBuehler#1195) * Always pass _USE_MATH_DEFINES * Cargo.lock * Remove matmul via f16 framework (EricLBuehler#1196) * Remove API for matmul_via_f16 (EricLBuehler#1197) * Add UQFF text/vision model API (EricLBuehler#1198) * Add UQFF text/vision model API * Typos * Implement Qwen 2.5 VL! (EricLBuehler#1184) * Implement Qwen 2.5 VL * Reverse window index select * Switch to rmsnorm * Warn * Fix config, loads now * Fixes * Complete qwen2_5vl feature Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda * Fix Rustfmt and Clippy issues * Clean code * Merge branch ‘main’ --------- Co-authored-by: Eric Buehler <[email protected]> * Implement Gemma 3 (text only)! (EricLBuehler#1201) * Add config * Add the text model * Add inputs processor, loads/runs now * It works! * Add to APIs * Implement Gemma 3 vision support! (EricLBuehler#1202) * Add vision support for Gemma 3 * Implement image preprocessor and processor * It works, kind of * It works great * Mask must be contiguous * Update docs * Format * Manually fixup sentencepiece detok (EricLBuehler#1204) * More vision models with TP (EricLBuehler#1200) * More models for tp * Fix clippy * Fix topology link in the docs (EricLBuehler#1205) * Gemma3 1b support and optimized rotating cache (EricLBuehler#1206) * Support text-only gemma3 * Add rotating kv cache * Do not preallocate rotating kv cache * Improve rotating kv cache, prefix cacher system (EricLBuehler#1207) * Improve rotating kv cache set_len and more intelligent prefix cacher v2 * Remove prefix cacher v1 * Better handling for kvcache set_len (EricLBuehler#1208) * Fix gemma3 vision device in isq * Update deps and use rand 0.9 (EricLBuehler#1210) * Fix flash-attn v3 build * Update hf hub dep, add initial blockwise fp8 GEMM tests (EricLBuehler#1212) * Update hf_hub dep to not require openssl and add tests * Update deps * Fixes * Undo 'fix' from clippy * Ok maybe finally fix it * Growable RotatingKvCache and fixes for Phi-4 mini (EricLBuehler#1215) * Fixes for phi4 mini * Fix causal mask * Growable rotating kv cache * Fix clippy * Use docker build for x86 pyo3 wheels * Fix cuda warn * Vision model pagedattn fixes (EricLBuehler#1217) * Gemma 3 cuda fixes * Fix pagedattn bug * Clippy * Small fix for rotating cache? * Add pydantic schema examples! (EricLBuehler#1219) * Sliding window attention fixes (EricLBuehler#1220) * Initial fixes for sliding window * Fix swa, still without prefix cache * Ok finally it works * Handle multiple eos toks * adapt to rig crate as client (EricLBuehler#1214) * adapt to rig crate as client * adapt to rig crate as client * Implement Mistral 3! (EricLBuehler#1221) * Add vision model and load language model * Implement the mmproj and patch merger! * Remove plot * Reshaping patch embeds with image sizes, make block attn mask * Add the inputs merging and forward * Basic loader, a bunch of todos still * Add the inputs processor * Clippy * Some fixes * It works! * Implement for the automatic device mapping * ISQ support for the vision model too * Docs * Fused Metal SDPA with masking! (EricLBuehler#1225) * Metal SDPA with masking * Much faster quantization on metal! * Check if actually metal * Materialize the mask * Fix cuda * Format * Send [DONE] SSE chunk per openai spec (EricLBuehler#1226) * Fix handling of device when compiled for but disabled nccl (EricLBuehler#1227) * Fix nccl blocking case (EricLBuehler#1228) * Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! (EricLBuehler#1229) * Llama model tool calling support * Llama tool calling works * Nice tool calling support * Tool calling working with Mistral 3 * Support hermes * Mistral nemo support * Update server tool calling example * OpenAI API compatability fixes (EricLBuehler#1230) * Content itself is optional * Only provide tool calls if they are not empty * Add response_format support * Fix response-format * Fix json_schema.py example * [Breaking] Automatic server logging (EricLBuehler#1231) * Add logger for server * Clippy * Tweak * Configurable * Format * Remove simple_tool_calling.py as deprecated * Use default stream for flash attn (EricLBuehler#1232) * More accurate throughput logging * Bump version to 0.5.0 (EricLBuehler#1233) * Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <[email protected]> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <[email protected]> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * updates from candle * fixes * relax tokio * make AdapterPaths, LoraAdapterPaths public --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: brrr <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Etienne Balit <[email protected]> Co-authored-by: benliao <[email protected]> Co-authored-by: edwko <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Guoqing Bao <[email protected]>
1 parent 808f39b commit 91a5ad7

File tree

319 files changed

+29402
-10676
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

319 files changed

+29402
-10676
lines changed

.github/workflows/build_cuda_all.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ jobs:
4343
${{ runner.os }}-buildx-
4444
4545
- name: Inject slug/short variables
46+
id: slug
4647
uses: rlespinasse/[email protected]
4748

4849
- name: Login to GitHub Container Registry
@@ -58,14 +59,14 @@ jobs:
5859
uses: docker/metadata-action@v5
5960
with:
6061
images: |
61-
ghcr.io/${{env.GITHUB_REPOSITORY_OWNER_PART}}/${{env.GITHUB_REPOSITORY_NAME_PART}}
62+
ghcr.io/${{ github.repository_owner }}/$(basename ${{ github.repository }})
6263
flavor: |
6364
latest=false
6465
tags: |
6566
type=semver,pattern=cuda-${{matrix.compute_capability}}-{{version}}
6667
type=semver,pattern=cuda-${{matrix.compute_capability}}-{{major}}.{{minor}}
67-
type=raw,value=cuda-${{matrix.compute_capability}}-latest,enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
68-
type=raw,value=cuda-${{matrix.compute_capability}}-sha-${{ env.GITHUB_SHA_SHORT }}
68+
type=raw,value=cuda-${{matrix.compute_capability}}-sha-${{ steps.slug.outputs.short_sha }}
69+
type=raw,value=cuda-${{matrix.compute_capability}}-sha-${{ github.sha }}
6970
- name: Build and push Docker image
7071
id: build-and-push-cuda
7172
uses: docker/build-push-action@v6

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ jobs:
4949
TESTS_HF_TOKEN: ${{ secrets.HF_TOKEN }}
5050
with:
5151
command: test
52-
args: --workspace
52+
args: -p mistralrs-core -p mistralrs-quant -p mistralrs-vision
5353

5454
fmt:
5555
name: Rustfmt

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
.ruff_cache
33
.vscode
44
*.a
5-
.DS_Store
5+
.DS_Store
6+
.idea

.typos.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ extend-ignore-identifiers-re = [
66
"Nd",
77
"nin",
88
"cudaDevAttrMaxSharedMemoryPerBlockOptin",
9-
"_thw"
9+
"_thw",
10+
"thr",
11+
"nd",
12+
"uneeded"
1013
]
1114

1215
[files]

0 commit comments

Comments
 (0)