Skip to content

Refactor NCCL device mappers #1172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 5, 2025
Merged

Conversation

EricLBuehler
Copy link
Owner

No description provided.

Copy link

github-actions bot commented Mar 5, 2025

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 71         3026         2622           81          323
 Shell                   1           58           22           18           18
 Plain Text              3         3723            0         2413         1310
 TOML                   19          529          491            2           36
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               49         4044            0         3071          973
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                16          549          464            0           85
 |- TOML                 2           75           63            0           12
 (Total)                           4904          748         3071         1085
-------------------------------------------------------------------------------
 Rust                  325       106732        95539         2135         9058
 |- Markdown           157         1783           25         1621          137
 (Total)                         108515        95564         3756         9195
===============================================================================
 Total                 489       118314        98847         7732        11735
===============================================================================
  

@EricLBuehler EricLBuehler merged commit b73e2e9 into master Mar 5, 2025
10 of 12 checks passed
@EricLBuehler EricLBuehler deleted the refactor_nccl_device_map branch March 5, 2025 01:36
Jeadie added a commit to spiceai/mistral.rs that referenced this pull request Apr 20, 2025
* Refactor NCCL device mappers (EricLBuehler#1172)

* Bump ring from 0.17.11 to 0.17.13 (EricLBuehler#1179)

Bumps [ring](https://github.com/briansmith/ring) from 0.17.11 to 0.17.13.
- [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md)
- [Commits](https://github.com/briansmith/ring/commits)

---
updated-dependencies:
- dependency-name: ring
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* DSV3/R1 fixes (EricLBuehler#1173)

* DSv3 fixes

* Just save the progress

* Fix launch of blockwise fp8 dequant

* It actually works

* Async ops

* Optimize non-mla with cat

* Fix non-cuda build

* Update build

* Add more CUDA_CHECK

* Works really now

* Working fully now with pagedattn

* Format everything

* Fix diffusion device mapping (EricLBuehler#1187)

* Internal abstraction for distributed op (EricLBuehler#1188)

* Make Sequence::set_toks more safe (EricLBuehler#1190)

* Fix CI tests out of storage (EricLBuehler#1191)

* Internal abstraction for distributed op (EricLBuehler#1189)

* Fix build_cuda_all.yaml CI (EricLBuehler#1193)

* Support tensor parallelism for vision models! (EricLBuehler#1194)

* Refactor distributed mapper prep

* Support vision model TP

* Update docs

* Add vision model TP for mllama

* Always pass _USE_MATH_DEFINES for CUDA (EricLBuehler#1195)

* Always pass _USE_MATH_DEFINES

* Cargo.lock

* Remove matmul via f16 framework (EricLBuehler#1196)

* Remove API for matmul_via_f16 (EricLBuehler#1197)

* Add UQFF text/vision model API (EricLBuehler#1198)

* Add UQFF text/vision model API

* Typos

* Implement Qwen 2.5 VL! (EricLBuehler#1184)

* Implement Qwen 2.5 VL

* Reverse window index select

* Switch to rmsnorm

* Warn

* Fix config, loads now

* Fixes

* Complete qwen2_5vl feature

Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision.
It’s hard to figure it out during subsequent debugging
Anyhow, globally setting matnuml precision MAY not be a ideal solution.
 For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs

Qwen2_5vl feature  is functional, start to clean code

Add examples for lower_level_qwen2_5vl

Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None

Clean code

Rebase

Clean code

Fix cuda

* Fix Rustfmt and Clippy issues

* Clean code

* Merge branch ‘main’

---------

Co-authored-by: Eric Buehler <[email protected]>

* Implement Gemma 3 (text only)! (EricLBuehler#1201)

* Add config

* Add the text model

* Add inputs processor, loads/runs now

* It works!

* Add to APIs

* Implement Gemma 3 vision support! (EricLBuehler#1202)

* Add vision support for Gemma 3

* Implement image preprocessor and processor

* It works, kind of

* It works great

* Mask must be contiguous

* Update docs

* Format

* Manually fixup sentencepiece detok (EricLBuehler#1204)

* More vision models with TP (EricLBuehler#1200)

* More models for tp

* Fix clippy

* Fix topology link in the docs (EricLBuehler#1205)

* Gemma3 1b support and optimized rotating cache (EricLBuehler#1206)

* Support text-only gemma3

* Add rotating kv cache

* Do not preallocate rotating kv cache

* Improve rotating kv cache, prefix cacher system (EricLBuehler#1207)

* Improve rotating kv cache set_len and more intelligent prefix cacher v2

* Remove prefix cacher v1

* Better handling for kvcache set_len (EricLBuehler#1208)

* Fix gemma3 vision device in isq

* Update deps and use rand 0.9 (EricLBuehler#1210)

* Fix flash-attn v3 build

* Update hf hub dep, add initial blockwise fp8 GEMM tests (EricLBuehler#1212)

* Update hf_hub dep to not require openssl and add tests

* Update deps

* Fixes

* Undo 'fix' from clippy

* Ok maybe finally fix it

* Growable RotatingKvCache and fixes for Phi-4 mini (EricLBuehler#1215)

* Fixes for phi4 mini

* Fix causal mask

* Growable rotating kv cache

* Fix clippy

* Use docker build for x86 pyo3 wheels

* Fix cuda warn

* Vision model pagedattn fixes (EricLBuehler#1217)

* Gemma 3 cuda fixes

* Fix pagedattn bug

* Clippy

* Small fix for rotating cache?

* Add pydantic schema examples! (EricLBuehler#1219)

* Sliding window attention fixes (EricLBuehler#1220)

* Initial fixes for sliding window

* Fix swa, still without prefix cache

* Ok finally it works

* Handle multiple eos toks

* adapt to rig crate as client (EricLBuehler#1214)

* adapt to rig crate as client

* adapt to rig crate as client

* Implement Mistral 3! (EricLBuehler#1221)

* Add vision model and load language model

* Implement the mmproj and patch merger!

* Remove plot

* Reshaping patch embeds with image sizes, make block attn mask

* Add the inputs merging and forward

* Basic loader, a bunch of todos still

* Add the inputs processor

* Clippy

* Some fixes

* It works!

* Implement for the automatic device mapping

* ISQ support for the vision model too

* Docs

* Fused Metal SDPA with masking! (EricLBuehler#1225)

* Metal SDPA with masking

* Much faster quantization on metal!

* Check if actually metal

* Materialize the mask

* Fix cuda

* Format

* Send [DONE] SSE chunk per openai spec (EricLBuehler#1226)

* Fix handling of device when compiled for but disabled nccl (EricLBuehler#1227)

* Fix nccl blocking case (EricLBuehler#1228)

* Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! (EricLBuehler#1229)

* Llama model tool calling support

* Llama tool calling works

* Nice tool calling support

* Tool calling working with Mistral 3

* Support hermes

* Mistral nemo support

* Update server tool calling example

* OpenAI API compatability fixes (EricLBuehler#1230)

* Content itself is optional

* Only provide tool calls if they are not empty

* Add response_format support

* Fix response-format

* Fix json_schema.py example

* [Breaking] Automatic server logging (EricLBuehler#1231)

* Add logger for server

* Clippy

* Tweak

* Configurable

* Format

* Remove simple_tool_calling.py as deprecated

* Use default stream for flash attn (EricLBuehler#1232)

* More accurate throughput logging

* Bump version to 0.5.0 (EricLBuehler#1233)

* Fix handling of Metal fused attn head dims (EricLBuehler#1234)

* Fix handling of metal attn head dims

* Fix handling of gemma3 1b when images

* Tweak default for paged attn builder

* Support paged attn for vision model rust api (EricLBuehler#1235)

* [Breaking] Support setting HF cache path (EricLBuehler#1237)

* Add it internally

* Add the apis

* Support tool calling for DeepSeek models (EricLBuehler#1239)

* Support tool calling for deepseek models

* Format

* Fix deepseek

* Server image processing refactor and fixes (EricLBuehler#1244)

* Fix strict gemma3 case

* Accept multiple images in the content array

* Fix multiple images in one array ct

* Add it to the python api

* Typos

* Optimized CUDA RoPE kernels (EricLBuehler#1247)

* Add the kernels

* It works

* Works

* Buulds

* Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246)

* Fix typo

* Update mistralrs.pyi

* Fixes for UQFF + distributed layers (EricLBuehler#1250)

* Fixes for uqff + distributed layers

* Typo

* Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243)

* Add the tool

* Actually search

* Clippy

* Sort of works

* Remove some debuggers

* tweak

* Add some rules

* Works great

* Tweak 'system' prompt

* Update mistralrs-core/src/search/mod.rs

Co-authored-by: Copilot <[email protected]>

* Typo

* Add it to all the apis

* Add bert model for similarity reranking

* Typos

* Early detection of tools

* Alias max_tokens -> max_completion_tokens too

* Customizable bert model

* Flip the enabler around

* Add docs

* Update readme

* Typo

---------

Co-authored-by: Copilot <[email protected]>

* Format kernels (EricLBuehler#1251)

* Update readme

* Update readme

* Remove test

* Add quantize guards for uqff deserialize (EricLBuehler#1252)

* Refactor cuBLASlt-related code (EricLBuehler#1253)

* Centralize cublaslt into mistralrs-quant

* Use cublaslt in unquant layer

* Use beautiful trait constants for simpler code

* Move tests

* Dispatch to unquant for cublaslt

* Dispatch to unquant for cublaslt

* Fix feature

* Add convert_to_gptq script

* Update deps, bump pyo3 version (EricLBuehler#1259)

* Faster cuda FP8 performance (EricLBuehler#1257)

* Avoid fp8 sync

* Fix dtype

* Rust 1.86 clippy (EricLBuehler#1260)

* Rust 1.86 clippy

* Clippy

* Refactor engine arch (EricLBuehler#1262)

* Refactor engine add_request

* Don't recompile regex

* Clippy

* Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263)

* Play with varbuilder lifetimes

* Merge lora weights

* Clippy

* Lora works

* Support multiple loras

* Cleanup, remove adapter activation

* Complete merge

* Fast Metal-specific quantization method: AFQ (EricLBuehler#1264)

* Add mlx quantized kernels

* Add mlx quantized kernels

* Kernel launcher

* Add AFQ isq quant and dequant

* Some quantmethod things

* Begin to implement the qmm caller

* Clippy

* Much faster

* Cache kernels

* Docs

* Clippy

* Add it to uqff

* Support prequantized models from MLX (EricLBuehler#1265)

* Refactor quantizedconfig

* Support AFQ prequantized

* Update docs

* Update docs

* Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266)

* Automatic isq

* typo

* Doc

* Improved usage metrics (EricLBuehler#1267)

* Fix cuda

* Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270)

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.44.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gather MM ops in mistralrs-quant (EricLBuehler#1272)

* Update the caller

* Wire things up

* Broadcase for afq gathermm

* Broadcase for afq gathermm

* Clippy

* Improve performance of deepseek models

* Typo fix

* BincountOp not used

* Implement Llama 4! (EricLBuehler#1268)

* Implement Llama 4

* Implement the main changes for the text model

* Make chunked mask

* Wire things up

* Add some EP

* Initial sketch of inputs processor

* Runs

* Progress

* all reduce moes

* It works!

* Some cleanup

* Faster moe block

* Add device map

* Make chunked matrix

* Fully working now!

* Reactivate cublaslt

* Fix shared mlp cublaslt

* Refactor to packed experts

* Complete merge

* It is a normal model now

* Fixes

* Set device for moe

* ISQ fixes

* Much faster sort kernel

* Faster loading!

* Faster loading!

* Fp8 cpu copy ops in candle backend

* Add the vision model

* Add mmproj layer

* Actually merge the inputs

* Sketch most of the image processor

* Add the rest of the image processor

* Implement the whole processor

* Add the loader

* Some fixes

* A batch of fixes

* Some fixes

* tmp

* Actually support isq

* Ok it works a bit

* Fix norm device

* It works

* A bit cleaner

* Support residul tensors

* Remove text loader

* Implement the device mapping system

* Fix auto device map

* Add examples

* Add model card

* Typo

* Remove superflous logging

* Fixes for Llama 4 UQFF loading (EricLBuehler#1275)

* Support sharding for UQFF (EricLBuehler#1276)

* Serialize sharded uqff files

* Loading

* Fix base64

* Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278)

* Support the DeepCoder model (EricLBuehler#1279)

* Add faq for metal not found

* updates from candle

* fixes

* relax tokio

* make AdapterPaths, LoraAdapterPaths public

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: brrr <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: Etienne Balit <[email protected]>
Co-authored-by: benliao <[email protected]>
Co-authored-by: edwko <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Guoqing Bao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant