-
Notifications
You must be signed in to change notification settings - Fork 431
Update deps and use rand 0.9 #1210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 34 29 0 5 Dockerfile 1 41 22 10 9 JSON 12 105 104 0 1 Makefile 1 6 5 0 1 Python 73 3126 2710 85 331 Shell 1 58 22 18 18 Plain Text 3 3723 0 2413 1310 TOML 19 531 492 2 37 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 50 4205 0 3196 1009 |- BASH 6 103 100 0 3 |- JSON 1 12 12 0 0 |- Python 7 121 109 0 12 |- Rust 17 586 495 0 91 |- TOML 2 75 63 0 12 (Total) 5102 779 3196 1127 ------------------------------------------------------------------------------- Rust 339 112404 100684 2173 9547 |- Markdown 158 1808 25 1642 141 (Total) 114212 100709 3815 9688 =============================================================================== Total 507 124254 104087 7899 12268 =============================================================================== |
Jeadie
added a commit
to spiceai/mistral.rs
that referenced
this pull request
Apr 20, 2025
* Refactor NCCL device mappers (EricLBuehler#1172) * Bump ring from 0.17.11 to 0.17.13 (EricLBuehler#1179) Bumps [ring](https://github.com/briansmith/ring) from 0.17.11 to 0.17.13. - [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md) - [Commits](https://github.com/briansmith/ring/commits) --- updated-dependencies: - dependency-name: ring dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * DSV3/R1 fixes (EricLBuehler#1173) * DSv3 fixes * Just save the progress * Fix launch of blockwise fp8 dequant * It actually works * Async ops * Optimize non-mla with cat * Fix non-cuda build * Update build * Add more CUDA_CHECK * Works really now * Working fully now with pagedattn * Format everything * Fix diffusion device mapping (EricLBuehler#1187) * Internal abstraction for distributed op (EricLBuehler#1188) * Make Sequence::set_toks more safe (EricLBuehler#1190) * Fix CI tests out of storage (EricLBuehler#1191) * Internal abstraction for distributed op (EricLBuehler#1189) * Fix build_cuda_all.yaml CI (EricLBuehler#1193) * Support tensor parallelism for vision models! (EricLBuehler#1194) * Refactor distributed mapper prep * Support vision model TP * Update docs * Add vision model TP for mllama * Always pass _USE_MATH_DEFINES for CUDA (EricLBuehler#1195) * Always pass _USE_MATH_DEFINES * Cargo.lock * Remove matmul via f16 framework (EricLBuehler#1196) * Remove API for matmul_via_f16 (EricLBuehler#1197) * Add UQFF text/vision model API (EricLBuehler#1198) * Add UQFF text/vision model API * Typos * Implement Qwen 2.5 VL! (EricLBuehler#1184) * Implement Qwen 2.5 VL * Reverse window index select * Switch to rmsnorm * Warn * Fix config, loads now * Fixes * Complete qwen2_5vl feature Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda * Fix Rustfmt and Clippy issues * Clean code * Merge branch ‘main’ --------- Co-authored-by: Eric Buehler <[email protected]> * Implement Gemma 3 (text only)! (EricLBuehler#1201) * Add config * Add the text model * Add inputs processor, loads/runs now * It works! * Add to APIs * Implement Gemma 3 vision support! (EricLBuehler#1202) * Add vision support for Gemma 3 * Implement image preprocessor and processor * It works, kind of * It works great * Mask must be contiguous * Update docs * Format * Manually fixup sentencepiece detok (EricLBuehler#1204) * More vision models with TP (EricLBuehler#1200) * More models for tp * Fix clippy * Fix topology link in the docs (EricLBuehler#1205) * Gemma3 1b support and optimized rotating cache (EricLBuehler#1206) * Support text-only gemma3 * Add rotating kv cache * Do not preallocate rotating kv cache * Improve rotating kv cache, prefix cacher system (EricLBuehler#1207) * Improve rotating kv cache set_len and more intelligent prefix cacher v2 * Remove prefix cacher v1 * Better handling for kvcache set_len (EricLBuehler#1208) * Fix gemma3 vision device in isq * Update deps and use rand 0.9 (EricLBuehler#1210) * Fix flash-attn v3 build * Update hf hub dep, add initial blockwise fp8 GEMM tests (EricLBuehler#1212) * Update hf_hub dep to not require openssl and add tests * Update deps * Fixes * Undo 'fix' from clippy * Ok maybe finally fix it * Growable RotatingKvCache and fixes for Phi-4 mini (EricLBuehler#1215) * Fixes for phi4 mini * Fix causal mask * Growable rotating kv cache * Fix clippy * Use docker build for x86 pyo3 wheels * Fix cuda warn * Vision model pagedattn fixes (EricLBuehler#1217) * Gemma 3 cuda fixes * Fix pagedattn bug * Clippy * Small fix for rotating cache? * Add pydantic schema examples! (EricLBuehler#1219) * Sliding window attention fixes (EricLBuehler#1220) * Initial fixes for sliding window * Fix swa, still without prefix cache * Ok finally it works * Handle multiple eos toks * adapt to rig crate as client (EricLBuehler#1214) * adapt to rig crate as client * adapt to rig crate as client * Implement Mistral 3! (EricLBuehler#1221) * Add vision model and load language model * Implement the mmproj and patch merger! * Remove plot * Reshaping patch embeds with image sizes, make block attn mask * Add the inputs merging and forward * Basic loader, a bunch of todos still * Add the inputs processor * Clippy * Some fixes * It works! * Implement for the automatic device mapping * ISQ support for the vision model too * Docs * Fused Metal SDPA with masking! (EricLBuehler#1225) * Metal SDPA with masking * Much faster quantization on metal! * Check if actually metal * Materialize the mask * Fix cuda * Format * Send [DONE] SSE chunk per openai spec (EricLBuehler#1226) * Fix handling of device when compiled for but disabled nccl (EricLBuehler#1227) * Fix nccl blocking case (EricLBuehler#1228) * Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! (EricLBuehler#1229) * Llama model tool calling support * Llama tool calling works * Nice tool calling support * Tool calling working with Mistral 3 * Support hermes * Mistral nemo support * Update server tool calling example * OpenAI API compatability fixes (EricLBuehler#1230) * Content itself is optional * Only provide tool calls if they are not empty * Add response_format support * Fix response-format * Fix json_schema.py example * [Breaking] Automatic server logging (EricLBuehler#1231) * Add logger for server * Clippy * Tweak * Configurable * Format * Remove simple_tool_calling.py as deprecated * Use default stream for flash attn (EricLBuehler#1232) * More accurate throughput logging * Bump version to 0.5.0 (EricLBuehler#1233) * Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <[email protected]> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <[email protected]> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * updates from candle * fixes * relax tokio * make AdapterPaths, LoraAdapterPaths public --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: brrr <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Etienne Balit <[email protected]> Co-authored-by: benliao <[email protected]> Co-authored-by: edwko <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Guoqing Bao <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.