Releases · ggml-org/llama.cpp

03 Jul 11:13

9067487

b5816

ggml : fix FA mask dim 2 and 3 (#14505)

* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1

Assets 15

03 Jul 05:05

github-actions

b5815

d4cdd9c

b5815

ggml : remove kompute backend (#14501)

ggml-ci

Assets 15

03 Jul 00:53

github-actions

b5814

55c2646

b5814

CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)

Assets 15

02 Jul 19:15

github-actions

b5812

5d46bab

b5812

llama : initial Mamba-2 support (#9126)

* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size

Assets 15

02 Jul 19:03

github-actions

b5811

e17991c

b5811

sync : ggml

ggml-ci

Assets 15

02 Jul 17:05

github-actions

b5809

f3ed38d

b5809

Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dyn…

Assets 15

02 Jul 15:29

github-actions

b5808

55a1c5a

b5808

CUDA: add softmax broadcast (#14475)

* CUDA: add softmax broadcast

* Pass by const ref

* Review: Use blockDims for indexing, remove designated initializers

* Add TODO for noncontigous input/output

Assets 15

02 Jul 15:06

github-actions

b5804

307e79d

b5804

opencl : fix possible buffer overflow in dump_tensor (#14490)

Assets 15

02 Jul 13:16

github-actions

b5803

d7f5f4e

b5803

simple-chat : fix context-exceeded condition (#14494)

* simple-chat : fix context-exceeded condition

ggml-ci

* cont : fix n_ctx_used computation

ggml-ci

Assets 15

02 Jul 12:39

github-actions

b5802

c8a4e47

b5802

opencl : skip empty nodes on cgraph compute (#14491)

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5816

Uh oh!

b5815

Uh oh!

b5814

Uh oh!

b5812

Uh oh!

b5811

Uh oh!

b5809

Uh oh!

b5808

Uh oh!

b5804

Uh oh!

b5803

Uh oh!

b5802

Uh oh!