kv-cache : separate recurrent vs non-recurrent impl #12799

ggerganov · 2025-04-07T13:57:09Z

Overview

Attempting to make two separate classes for the 2 types of KV cache:

llama_kv_cache_unified : llama_kv_cache
llama_kv_cache_recurrent : llama_kv_cache

graph TD;
llama_memory_i --> llama_kv_cache
llama_kv_cache --> llama_kv_cache_unified
llama_kv_cache --> llama_kv_cache_recurrent

The main goal of this change is to simplify the logic in the primary llama_kv_cache_unified class so that we can more easily extend it with new features such as SWA. Also to introduce a certain level of abstraction that would allow to add new types of KV cache implementations in the future.

Main changes

The llama_context now operates with the abstract llama_memory_i interface.
Add llama_memory_params and use it to implement llama_model::create_memory() for creating model-specific cache
llama_kv_cache_recurrent is currently mostly a copy of llama_kv_cache_unified, but should be now completely separated and a new recurrent-specific implementation can be done
Move KV cache shift and defrag code from llama_context to llama_kv_cache_unified
The llama_sbatch -> llama_ubatch logic inside llama_context:decode() is now implemented by:
- llama_kv_cache::sbatch_init()
- llama_kv_cache::ubatch_next()
The thinking is that certain KV cache implementation could require different types of micro-batching (e.g. same-sequence-length ubatch, single-sequence ubatch, etc.)
Remove llama_context::output_reorder() - seemed to be relevant only for recurrent caches. We now have inlined the logic in llama_context:decode()
Remove llama_context::sbatch. Instead, create a new one for each decode

TODO before merge

Clean-up llama_kv_cache interface
Make llama_kv_cache_xxx more private
Add comments

Next PRs

Support cache-less context (needed for embedding-only models such as BERT) (context : allow cache-less context for embeddings #13108)
Remove llama_context_params.logits_all logic - unnecessary complication, can be achieved with explicit request for logits for all tokens
Remove infill example - obsolete
Add proper SWA support to llama_kv_cache_unified

Resolve

llama : fix K-shift with quantized K and BLAS backend #13113

slaren · 2025-04-29T15:53:48Z

What the reasoning for using llama_kv_cache as the base class for llama_kv_cache_recurrent? Is there enough code shared between these types to justify this? It seems that there is a lot of complexity in llama_kv_cache_recurrent and it would be good if that could be simplified a bit.

On a more general note, I think it is not very usual the way std::function callbacks are mixed with inheritance. I think the more typical way to do this would be to create virtual functions that can be overriden in a child class. I wonder if I am missing something here that would prevent implementing this in this way.

ggerganov · 2025-04-29T18:01:05Z

What the reasoning for using llama_kv_cache as the base class for llama_kv_cache_recurrent? Is there enough code shared between these types to justify this? It seems that there is a lot of complexity in llama_kv_cache_recurrent and it would be good if that could be simplified a bit.

The public API currently works with struct llama_kv_cache *, so both the recurrent and non-recurrent implementation have to implement it for now.

I think what we need to do in a follow-up PR is:

Deprecate the public API llama_kv_cache_
Add llama_memory_ API that works with struct llama_memory

At this point, a completely new recurrent-specific implementation can be added: class llama_memory_recurrent : public llama_memory_i that would replace the current llama_kv_cache_recurrent.

The existing recurrent cache implementation has to be rewritten from scratch, because is was hacked on top of the KV cache implementation by repurposing the K and V tensors for the state space requirements.

On a more general note, I think it is not very usual the way std::function callbacks are mixed with inheritance. I think the more typical way to do this would be to create virtual functions that can be overridden in a child class. I wonder if I am missing something here that would prevent implementing this in this way.

I'll try to update this. Just to make sure, you mean the current:

struct llama_kv_cache::callbacks
struct llama_kv_cache::graph_params

to become interfaces with different implementations based on the type of memory?

slaren · 2025-04-29T18:12:13Z

I don't fully understand the code, but I think get_rope_factors and get_buft could be virtual/abstract functions of llama_kv_cache, and a new class can be created if these need a different implementations. But I am not sure that's the case, maybe they can just be regular functions and llama_kv_cache should just have a reference to the llama_model. graph_params looks like it should be an interface, but it feels out of place.

compilade · 2025-04-29T18:33:18Z

The public API currently works with struct llama_kv_cache *, so both the recurrent and non-recurrent implementation have to implement it for now.

There will need to be some top-level type which can contain multiple types of KV caches to ease supporting hybrid models. A shared interface for recurrent and non-recurrent state caches is useful to get to that point, at least for maintainability.

The hardest part will be handling errors and properly keeping coherency between the different types of caches (because they don't necessarily roll-back states in the same way). That is relevant mostly for hybrid models, though.

The existing recurrent cache implementation has to be rewritten from scratch, because is was hacked on top of the KV cache implementation by repurposing the K and V tensors for the state space requirements.

Yes it will need to be rewritten at least to be able to support proper state rollback.

But even if it was repurposing the K and V tensors, there are still some things which I think will remain, since Mamba and RWKV do have 2 types of recurrent states per layer.

ggerganov · 2025-04-30T08:29:06Z

@slaren In 7e4b545 I replaced the struct callbacks by maintaining a reference of llama_model in the llama_kv_cache implementation. And in 73df685 I replaced the struct graph_params by passing a reference to the llama_context.

PTAL if you think these changes are good.

slaren

The changes look good. While testing this, I noticed that the KV cache is always allocated on the CPU.

src/llama-context.h

src/llama-graph.h

src/llama-kv-cache.cpp

src/llama-model.cpp

compilade · 2025-05-02T02:37:40Z

src/llama-kv-cache.cpp

+
+    //////////////////////////////////////////////
+    // TODO: this should not mutate the KV cache !
+    kv_cell & cell = const_cast<kv_cell &>(cells[i]);


Suggested change

kv_cell & cell = const_cast<kv_cell &>(cells[i]);

kv_cell & cell = const_cast<kv_cell &>(cells[cell_id]);

Otherwise multi-user inference is broken for recurrent models. See #9126 (comment).

compilade · 2025-05-02T02:38:23Z

src/llama-kv-cache.cpp

+
+    //////////////////////////////////////////////
+    // TODO: this should not mutate the KV cache !
+    kv_cell & cell = const_cast<kv_cell &>(cells[i]);


Suggested change

kv_cell & cell = const_cast<kv_cell &>(cells[i]);

kv_cell & cell = const_cast<kv_cell &>(cells[cell_id]);

Same, this should fix multi-user inference.

We should add a small multi-user test with a recurrent model to server/tests to be able to spot such regressions.

ggml-ci

ref #13113 ggml-ci

ggml-ci

ggerganov · 2025-05-02T13:37:28Z

@slaren @compilade I think this should be good to merge - any additional comments?

ggerganov · 2025-05-02T14:19:45Z

There will need to be some top-level type which can contain multiple types of KV caches to ease supporting hybrid models. A shared interface for recurrent and non-recurrent state caches is useful to get to that point, at least for maintainability.

The hardest part will be handling errors and properly keeping coherency between the different types of caches (because they don't necessarily roll-back states in the same way). That is relevant mostly for hybrid models, though.

I think that when we introduce the llama_memory_ API (see #12799 (comment)) we can redesign how the caches are used. The existing llama_kv_cache_seq_ API is not great in general (error prone and a bit hacky to use), so it would be a good opportunity to think about ways to simplify and improve it.

compilade · 2025-05-02T15:31:22Z

src/llama-context.cpp

+        // make the outputs have the same order they had in the user-provided batch
+        // note: this is mostly relevant for recurrent models atm


It's also only relevant when using get_embeddings, because the buffer in that case has to be ordered to keep the API backward compatible. When purely using get_embeddings_ith, it's not required.

Unconditionally sorting is unnecessary and is likely slower. Also it seems like some assertions here break multi-user inference for recurrent models (since the line right after this block where n_outputs = n_outputs_all is assumed to have run before the sorting routine, but it hasn't).

The main reason to decide to always reorder is because otherwise we have to maintain the sbatch in the state of the context. This introduces some complexity that is hard to reason around so I decided to take the hit.

We should add a test that exercises this branch. What is a server scenario that would trigger the reordering?

I'll PR the n_outputs = n_outputs_all before the sorting fix.

ggerganov mentioned this pull request Apr 8, 2025

llama : fix FA when KV cache is not used (i.e. embeddings) #12825

Merged

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from e4a626a to d953616 Compare April 17, 2025 11:17

ggerganov mentioned this pull request Apr 20, 2025

examples : remove finetune and train-text-from-scratch #8669

Merged

4 tasks

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from ed8942a to 2c3547e Compare April 22, 2025 13:16

ggerganov marked this pull request as ready for review April 24, 2025 13:29

ggerganov requested a review from slaren April 24, 2025 13:32

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from 7414574 to d31e31d Compare April 24, 2025 13:34

ggerganov mentioned this pull request Apr 24, 2025

llama : refactor kv cache guard #12695

Merged

ExtReMLapin mentioned this pull request Apr 25, 2025

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

Open

ggerganov mentioned this pull request Apr 25, 2025

[sync #10544] llama/ggml: add LLM training support #13105

Draft

1 task

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from b37b295 to dec80ac Compare April 25, 2025 10:47

ggerganov mentioned this pull request Apr 25, 2025

context : allow cache-less context for embeddings #13108

Draft

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from 66f1ba6 to 65cde6d Compare April 28, 2025 08:57

ggerganov mentioned this pull request Apr 29, 2025

kv-cache : add SWA support #13194

Draft

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from e37f112 to 7e4b545 Compare April 30, 2025 07:22

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from 73df685 to eb623f2 Compare April 30, 2025 08:30

slaren reviewed Apr 30, 2025

View reviewed changes

src/llama-context.h Outdated Show resolved Hide resolved

ggerganov commented Apr 30, 2025

View reviewed changes

src/llama-context.h Outdated Show resolved Hide resolved

slaren reviewed Apr 30, 2025

View reviewed changes

src/llama-graph.h Show resolved Hide resolved

src/llama-kv-cache.cpp Outdated Show resolved Hide resolved

src/llama-kv-cache.cpp Outdated Show resolved Hide resolved

src/llama-kv-cache.cpp Show resolved Hide resolved

src/llama-model.cpp Show resolved Hide resolved

slaren approved these changes Apr 30, 2025

View reviewed changes

compilade mentioned this pull request May 1, 2025

llama : support Jamba hybrid Transformer-Mamba models #7531

Draft

17 tasks

compilade reviewed May 2, 2025

View reviewed changes

ggerganov added 17 commits May 2, 2025 13:25

kv-cache : hide defrag logic in the implementation

ae2cd00

ggml-ci

context : hide kv cache details in implementation

fdb7206

ggml-ci

build : fix

13d69a5

ggml-ci

cont : another fix

5ef7559

ggml-ci

kv-cache : simplify interface (wip)

6b50ba7

ggml-ci

kv-cache : use separate KV cell structs for unified/recurrent

cb02ac8

ggml-ci

kv-cache : clean-up

f584750

ggml-ci

model : better llama_model::create_model() signature

458f2a5

ggml-ci

kv-cache : fix recurrent seq_rm()

92e626b

ggml-ci

kv-cache : replace struct callbacks with llama_model &

43cbf38

ggml-ci

kv-cache : replace struct graph_params with llama_context &

6619832

ggml-ci

kv-cache : fix offload check

95a9f8b

ggml-ci

context : avoid passing unique_ptr

8737e65

ggml-ci

kv-cache : avoid using the backends from the llama_context

c9bddfc

ref #13113 ggml-ci

kv-cache : more consistent debug logs [no ci]

09195eb

kv-cache : do not pass the full llama_context for kv graphs

58e1d40

ggml-ci

kv-cache : remove comment

903e46f

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from 780d6fb to 58115a2 Compare May 2, 2025 10:28

ggerganov added 2 commits May 2, 2025 16:02

kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext

00cde5f

ggml-ci

kv-cache : fix recurrent multi-user case

7e79a42

ggml-ci

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from 58115a2 to 7e79a42 Compare May 2, 2025 13:02

memory : remove comments [no ci]

5883c90

ggerganov merged commit c642bc0 into master May 2, 2025
1 check passed

ggerganov deleted the gg/llama-kv-cache-v6 branch May 2, 2025 14:48

compilade reviewed May 2, 2025

View reviewed changes

ggerganov mentioned this pull request May 2, 2025

context : fix reorder logic #13267

Merged

gabe-l-hart mentioned this pull request May 2, 2025

Feature Request: Granite 4 Support #13275

Open

16 tasks

compilade mentioned this pull request May 2, 2025

feat: First pass at llama_kv_cache_hybrid #13276

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache : separate recurrent vs non-recurrent impl #12799

kv-cache : separate recurrent vs non-recurrent impl #12799

ggerganov commented Apr 7, 2025 •

edited

Loading

slaren commented Apr 29, 2025

ggerganov commented Apr 29, 2025

slaren commented Apr 29, 2025

compilade commented Apr 29, 2025 •

edited

Loading

ggerganov commented Apr 30, 2025

slaren left a comment

compilade May 2, 2025

compilade May 2, 2025

ggerganov May 2, 2025

ggerganov commented May 2, 2025

ggerganov commented May 2, 2025

compilade May 2, 2025

ggerganov May 2, 2025

	kv_cell & cell = const_cast<kv_cell &>(cells[i]);
	kv_cell & cell = const_cast<kv_cell &>(cells[cell_id]);

		// make the outputs have the same order they had in the user-provided batch
		// note: this is mostly relevant for recurrent models atm

kv-cache : separate recurrent vs non-recurrent impl #12799

kv-cache : separate recurrent vs non-recurrent impl #12799

Conversation

ggerganov commented Apr 7, 2025 • edited Loading

Overview

Main changes

TODO before merge

Next PRs

Resolve

slaren commented Apr 29, 2025

ggerganov commented Apr 29, 2025

slaren commented Apr 29, 2025

compilade commented Apr 29, 2025 • edited Loading

ggerganov commented Apr 30, 2025

slaren left a comment

Choose a reason for hiding this comment

compilade May 2, 2025

Choose a reason for hiding this comment

compilade May 2, 2025

Choose a reason for hiding this comment

ggerganov May 2, 2025

Choose a reason for hiding this comment

ggerganov commented May 2, 2025

ggerganov commented May 2, 2025

compilade May 2, 2025

Choose a reason for hiding this comment

ggerganov May 2, 2025

Choose a reason for hiding this comment

ggerganov commented Apr 7, 2025 •

edited

Loading

compilade commented Apr 29, 2025 •

edited

Loading