-
Notifications
You must be signed in to change notification settings - Fork 11.6k
CUDA: batched+noncont MMQ, refactor bs>1 MoE code #13199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
JohannesGaessler
merged 1 commit into
ggml-org:master
from
JohannesGaessler:cuda-moe-mmq-5
Apr 30, 2025
Merged
CUDA: batched+noncont MMQ, refactor bs>1 MoE code #13199
JohannesGaessler
merged 1 commit into
ggml-org:master
from
JohannesGaessler:cuda-moe-mmq-5
Apr 30, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I also see a good improvement on Windows:
|
slaren
approved these changes
Apr 30, 2025
gabe-l-hart
added a commit
to gabe-l-hart/llama.cpp
that referenced
this pull request
May 1, 2025
* origin/master: sync : ggml whisper : add check that target name exists (whisper/3103) ggml : suppress Windows compiler warnings (whisper/3075) mtmd : add **vision** support for Mistral Small 3.1 (ggml-org#13231) arg : remove CURLINFO_EFFECTIVE_METHOD (ggml-org#13228) llama-model : fix the reported size class for nomic-embed-text-v2-moe (ggml-org#13223) sync : ggml ggml : fix ggml_gallocr_ptr type (ggml/1205) cuda : fix unused variable compile warning (whisper/0) CUDA: batched+noncont MMQ, refactor bs>1 MoE code (ggml-org#13199) arg : -hf do not fail if url mismatch (ggml-org#13219) fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (ggml-org#13221) convert : improve model arch handling (ggml-org#13122) llava : remove duplicate include (ggml-org#13207) common : add -jf / --json-schema-file flag (ggml-org#12011)
@JohannesGaessler Sadly I'm getting:
If I revert the commit, then everything works fine. I'm using H100 and CUDA 12.6 |
Using which model and which exact command? |
Nexesenex
added a commit
to Nexesenex/croco.cpp
that referenced
this pull request
May 2, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ggml
changes relating to the ggml tensor library for machine learning
Nvidia GPU
Issues specific to Nvidia GPUs
testing
Everything test related
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes the following changes:
GET_ROWS
to allow for type conversion during the operation.src1
to be sorted by expert viaGET_ROWS
as well as for the inverse operation ondst
. The sorting in either direction can be done in a single kernel launch, the dedicated kernels that have been used so far can be removed.Performance changes
Performance increases most for small batch sizes and fast GPUs where the kernel launch overhead has more impact. I think there is still a lot of potential for optimization in the MMQ kernel. For the generic MoE code there are currently still unnecessary type conversions for FP16 and BF16; eliminating them will require some changes to the cuBLAS code. I did not try cublasGemmGroupedBatchedEx because it to my disappointment only supports
CUBLAS_COMPUTE_32F
, so no tensor cores. It may be worthwhile to instead do an implementation with regular batched GEMM by padding allsrc1
matrices to the max. number of tokens per expert - on modern GPUs this may end up being faster even if some of the work is wasted.