Releases · ggml-org/llama.cpp

03 Jul 22:14

28657a8

b5823 Latest

Latest

ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6
373 MB 2025-07-03T22:14:27Z
llama-b5823-bin-macos-arm64.zip

sha256:3940ba8785ce4d362dc34dfc26438650f38a76c19aa2fefac9b16675d554caaa
10.5 MB 2025-07-03T22:14:40Z
llama-b5823-bin-macos-x64.zip

sha256:7ff280f6a993601827e5b868bbcc4bac57bc4d6db28740eee98de12008ff4d7b
26.3 MB 2025-07-03T22:14:41Z
llama-b5823-bin-ubuntu-vulkan-x64.zip

sha256:68910260e144b5e1bc7a20bfacc562680fac7f4ab3d2d7f0f0c16f1c8ead6bc0
20.1 MB 2025-07-03T22:14:42Z
llama-b5823-bin-ubuntu-x64.zip

sha256:34e781d099c2a4824ae5d489972ecbcbdc1e57fbae6d4c20bd1859a9f7b46d64
12.4 MB 2025-07-03T22:14:43Z
llama-b5823-bin-win-cpu-arm64.zip

sha256:15d80a5ecd22c0303f152bd7031e4b3d497e664a635358ef59eddcb64cbb2f99
10.8 MB 2025-07-03T22:14:44Z
llama-b5823-bin-win-cpu-x64.zip

sha256:71440986b8aa5bf1b6080c203608a3bb55c2c5786f5cdd79818b5c82fa22af70
13.6 MB 2025-07-03T22:14:45Z
llama-b5823-bin-win-cuda-12.4-x64.zip

sha256:8782cb5a21cff59946f4fb54b3ad916433d8e24693bb3de86221f92b04faef66
128 MB 2025-07-03T22:14:46Z
llama-b5823-bin-win-hip-radeon-x64.zip

sha256:9a859bec73ab8bd9bc34d02bcd4880b500354075a158acb1455a367341162f1e
298 MB 2025-07-03T22:14:50Z
llama-b5823-bin-win-opencl-adreno-arm64.zip

sha256:0fe52a1d90752322ceaccb852fa1232fe4fe59d03de8115a0857ea35181370f7
11.1 MB 2025-07-03T22:15:00Z
Source code (zip)

2025-07-03T21:07:22Z
Source code (tar.gz)

2025-07-03T21:07:22Z

03 Jul 19:24

github-actions

b5822

bee2842

b5822

opencl : broadcast for soft_max (#14510)

Assets 15

03 Jul 18:57

github-actions

b5821

2b72bed

b5821

vulkan: support mixed/deepseekR1 FA head sizes (#14509)

* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes

Assets 15

03 Jul 15:39

github-actions

b5820

c8c4495

b5820

ggml: backward pass for split swiglu (#14483)

Assets 15

03 Jul 11:18

github-actions

b5819

7b63a71

b5819

Fix conditional enabling following arch checks for ggml-sycl (#14504)

Signed-off-by: nscipione <[email protected]>

Assets 15

03 Jul 11:17

github-actions

b5817

a70c8a0

b5817

kv-cache : use ggml_set_rows (#14285)

* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci

Assets 15

03 Jul 11:13

github-actions

b5816

9067487

b5816

ggml : fix FA mask dim 2 and 3 (#14505)

* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1

Assets 15

03 Jul 05:05

github-actions

b5815

d4cdd9c

b5815

ggml : remove kompute backend (#14501)

ggml-ci

Assets 15

03 Jul 00:53

github-actions

b5814

55c2646

b5814

CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)

Assets 15

02 Jul 19:15

github-actions

b5812

5d46bab

b5812

llama : initial Mamba-2 support (#9126)

* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5823

Uh oh!

b5822

Uh oh!

b5821

Uh oh!

b5820

Uh oh!

b5819

Uh oh!

b5817

Uh oh!

b5816

Uh oh!

b5815

Uh oh!

b5814

Uh oh!

b5812

Uh oh!