Refactor NCCL device mappers #1172

EricLBuehler · 2025-03-05T01:30:08Z

No description provided.

github-actions · 2025-03-05T01:31:03Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 71         3026         2622           81          323
 Shell                   1           58           22           18           18
 Plain Text              3         3723            0         2413         1310
 TOML                   19          529          491            2           36
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               49         4044            0         3071          973
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                16          549          464            0           85
 |- TOML                 2           75           63            0           12
 (Total)                           4904          748         3071         1085
-------------------------------------------------------------------------------
 Rust                  325       106732        95539         2135         9058
 |- Markdown           157         1783           25         1621          137
 (Total)                         108515        95564         3756         9195
===============================================================================
 Total                 489       118314        98847         7732        11735
===============================================================================

* Refactor NCCL device mappers (EricLBuehler#1172) * Bump ring from 0.17.11 to 0.17.13 (EricLBuehler#1179) Bumps [ring](https://github.com/briansmith/ring) from 0.17.11 to 0.17.13. - [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md) - [Commits](https://github.com/briansmith/ring/commits) --- updated-dependencies: - dependency-name: ring dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * DSV3/R1 fixes (EricLBuehler#1173) * DSv3 fixes * Just save the progress * Fix launch of blockwise fp8 dequant * It actually works * Async ops * Optimize non-mla with cat * Fix non-cuda build * Update build * Add more CUDA_CHECK * Works really now * Working fully now with pagedattn * Format everything * Fix diffusion device mapping (EricLBuehler#1187) * Internal abstraction for distributed op (EricLBuehler#1188) * Make Sequence::set_toks more safe (EricLBuehler#1190) * Fix CI tests out of storage (EricLBuehler#1191) * Internal abstraction for distributed op (EricLBuehler#1189) * Fix build_cuda_all.yaml CI (EricLBuehler#1193) * Support tensor parallelism for vision models! (EricLBuehler#1194) * Refactor distributed mapper prep * Support vision model TP * Update docs * Add vision model TP for mllama * Always pass _USE_MATH_DEFINES for CUDA (EricLBuehler#1195) * Always pass _USE_MATH_DEFINES * Cargo.lock * Remove matmul via f16 framework (EricLBuehler#1196) * Remove API for matmul_via_f16 (EricLBuehler#1197) * Add UQFF text/vision model API (EricLBuehler#1198) * Add UQFF text/vision model API * Typos * Implement Qwen 2.5 VL! (EricLBuehler#1184) * Implement Qwen 2.5 VL * Reverse window index select * Switch to rmsnorm * Warn * Fix config, loads now * Fixes * Complete qwen2_5vl feature Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda * Fix Rustfmt and Clippy issues * Clean code * Merge branch ‘main’ --------- Co-authored-by: Eric Buehler <[email protected]> * Implement Gemma 3 (text only)! (EricLBuehler#1201) * Add config * Add the text model * Add inputs processor, loads/runs now * It works! * Add to APIs * Implement Gemma 3 vision support! (EricLBuehler#1202) * Add vision support for Gemma 3 * Implement image preprocessor and processor * It works, kind of * It works great * Mask must be contiguous * Update docs * Format * Manually fixup sentencepiece detok (EricLBuehler#1204) * More vision models with TP (EricLBuehler#1200) * More models for tp * Fix clippy * Fix topology link in the docs (EricLBuehler#1205) * Gemma3 1b support and optimized rotating cache (EricLBuehler#1206) * Support text-only gemma3 * Add rotating kv cache * Do not preallocate rotating kv cache * Improve rotating kv cache, prefix cacher system (EricLBuehler#1207) * Improve rotating kv cache set_len and more intelligent prefix cacher v2 * Remove prefix cacher v1 * Better handling for kvcache set_len (EricLBuehler#1208) * Fix gemma3 vision device in isq * Update deps and use rand 0.9 (EricLBuehler#1210) * Fix flash-attn v3 build * Update hf hub dep, add initial blockwise fp8 GEMM tests (EricLBuehler#1212) * Update hf_hub dep to not require openssl and add tests * Update deps * Fixes * Undo 'fix' from clippy * Ok maybe finally fix it * Growable RotatingKvCache and fixes for Phi-4 mini (EricLBuehler#1215) * Fixes for phi4 mini * Fix causal mask * Growable rotating kv cache * Fix clippy * Use docker build for x86 pyo3 wheels * Fix cuda warn * Vision model pagedattn fixes (EricLBuehler#1217) * Gemma 3 cuda fixes * Fix pagedattn bug * Clippy * Small fix for rotating cache? * Add pydantic schema examples! (EricLBuehler#1219) * Sliding window attention fixes (EricLBuehler#1220) * Initial fixes for sliding window * Fix swa, still without prefix cache * Ok finally it works * Handle multiple eos toks * adapt to rig crate as client (EricLBuehler#1214) * adapt to rig crate as client * adapt to rig crate as client * Implement Mistral 3! (EricLBuehler#1221) * Add vision model and load language model * Implement the mmproj and patch merger! * Remove plot * Reshaping patch embeds with image sizes, make block attn mask * Add the inputs merging and forward * Basic loader, a bunch of todos still * Add the inputs processor * Clippy * Some fixes * It works! * Implement for the automatic device mapping * ISQ support for the vision model too * Docs * Fused Metal SDPA with masking! (EricLBuehler#1225) * Metal SDPA with masking * Much faster quantization on metal! * Check if actually metal * Materialize the mask * Fix cuda * Format * Send [DONE] SSE chunk per openai spec (EricLBuehler#1226) * Fix handling of device when compiled for but disabled nccl (EricLBuehler#1227) * Fix nccl blocking case (EricLBuehler#1228) * Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! (EricLBuehler#1229) * Llama model tool calling support * Llama tool calling works * Nice tool calling support * Tool calling working with Mistral 3 * Support hermes * Mistral nemo support * Update server tool calling example * OpenAI API compatability fixes (EricLBuehler#1230) * Content itself is optional * Only provide tool calls if they are not empty * Add response_format support * Fix response-format * Fix json_schema.py example * [Breaking] Automatic server logging (EricLBuehler#1231) * Add logger for server * Clippy * Tweak * Configurable * Format * Remove simple_tool_calling.py as deprecated * Use default stream for flash attn (EricLBuehler#1232) * More accurate throughput logging * Bump version to 0.5.0 (EricLBuehler#1233) * Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <[email protected]> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <[email protected]> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * updates from candle * fixes * relax tokio * make AdapterPaths, LoraAdapterPaths public --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: brrr <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Etienne Balit <[email protected]> Co-authored-by: benliao <[email protected]> Co-authored-by: edwko <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Guoqing Bao <[email protected]>

Refactor nccl device mappers

4366208

EricLBuehler merged commit b73e2e9 into master Mar 5, 2025
10 of 12 checks passed

EricLBuehler deleted the refactor_nccl_device_map branch March 5, 2025 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor NCCL device mappers #1172

Refactor NCCL device mappers #1172

Uh oh!

EricLBuehler commented Mar 5, 2025

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Refactor NCCL device mappers #1172

Refactor NCCL device mappers #1172

Uh oh!

Conversation

EricLBuehler commented Mar 5, 2025

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!