Prefilling assistant message in openai compatible API #13174

matteoserva · 2025-04-29T07:22:20Z

This adds support for prefilling assistant response (or its thought process) using the OpenAI compatible API.

The feature is used for example by Claude.

It can be tested using open-webui or with the following curl command:

curl http://localhost:8080/apply-template \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
 {
    "role": "system",
    "content": "SYSTEM"
 },
 {
    "role": "user",
    "content": "USERMESSAGE"
 },
 {
    "role": "assistant",
    "content": "ASSISTANT"
 }
]
}'

Example advanced scenario: time limit for the thinking process

launch a reasoning model and stop its thought early
append </think> to its partial response
prefill the response and let it continue generating tokens

examples/server/utils.hpp

isaac-mcfadyen · 2025-04-30T04:26:08Z

Just a heads-up that this is potentially a very breaking change, especially because this is an OpenAI compatible API but this is not OpenAI's behavior.

The main situation I can think of is if someone wants to generate a new assistant message after the last one - i.e for ChatML they want the <|im_end|><|im_start|>assistant added between the last message and the new one, rather than the last message to just be continued.

I'd suggest we add this to #9291 at a minimum.

99991 · 2025-05-09T08:58:25Z

Just a heads-up that this is potentially a very breaking change, especially because this is an OpenAI compatible API but this is not OpenAI's behavior.

A better alternative would be to use an additional "prefix": True key in the message dict as in the Mistral API.

There is also this issue about a prefix API. I think there is an issue with token healing.

matteoserva · 2025-05-09T14:03:10Z

The feature is aligned with the claude api and the open-webui client.

Using "prefix": True would break most clients that expect the current api.

99991 · 2025-05-09T14:10:45Z

The feature is aligned with the claude api and the open-webui client.

Using "prefix": True would break most clients that expect the current api.

That is because the Claude API is strictly worse than the Mistral API. You can't even tell whether the Claude API is broken without inspecting the output and you can't shut it off if you don't want that behavior.

isaac-mcfadyen · 2025-05-09T15:32:45Z

The feature is aligned with the claude api and the open-webui client.

I believe llama-server is meant to be OpenAI compatible (which does not have this behavior), not Claude compatible.

Using "prefix": True would break most clients that expect the current api.

I believe those clients would still allow adding custom metadata, correct? In which case using prefix: True in the metadata as suggested would work and still allow them to work with the official Claude API because that metadata entry would just be ignored.

matteoserva · 2025-05-09T16:16:37Z

I believe those clients would still allow adding custom metadata, correct? In which case using prefix: True in the metadata as suggested would work and still allow them to work with the official Claude API because that metadata entry would just be ignored.

I am not aware of clients that support prefix: True in the message item but my knowledge is very limited.

An alternative implementation is continue_final_message in the request body as used by vllm.
Alternate alternative: add a command line option to disable the prefill feature.

For reference, here is an example code that shows how to use both options:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="test")

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hello!", "prefix": True}
  ],
  extra_body = {"continue_final_message": True}
)

isaac-mcfadyen · 2025-05-09T16:30:01Z

That sounds good, I'd very much vote for this being changed to a field in the body rather than default. 😄

isaac-mcfadyen · 2025-05-17T02:03:09Z

I just noticed that this change affected /apply-template and actually broke one of my deployed applications that uses that endpoint with an assistant message at the end (and expects <|im_start|>assistant\n to be added, which it no longer is).

@ngxson Apologies for not pinging when all of this was discussed this last week, but this might be something we want to revert? Aside from not being OpenAI compatible (OpenAI does not have this behavior) it breaks applications that don't expect this behavior... perhaps this could be put behind an optional parameter like discussed above (in a future PR)?

For context this was my use-case that this change broke (with different text but same idea):

curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "user", "content": "Hello?"}, {"role": "assistant", "content": "Hello! How are you?"]}'
# before: {"prompt":"<|im_start|>user\nHello?<|im_end|>\n<|im_start|>assistant\nHello! How are you?<|im_end|>\n<|im_start|>assistant\n"}
# after: {"prompt":"<|im_start|>user\nHello?<|im_end|>\n<|im_start|>assistant\nHello! How are you?"}

ngxson · 2025-05-17T08:01:59Z

I think what we can do is to add a boolean to control this behaviour.

Re. your point about OAI compat, I think OAI doesn't allow 2 assistant messages (correct me if I'm wrong). The original PR suggests that this feature is indeed copied from Claude API though tbh I haven't had time to test it myself.

Nevertheless, I think we should still keep this feature because it's the simplest way to control reasoning models.

matteoserva · 2025-05-17T08:20:10Z

AFAIK, OAI doesn't allow "assistant" as last role. It was allowed in older models for prefilling using the same api as in this PR. In recent models that feature is disabled.

I'm thinking of adding a command line flag to optionally disable that and revert to older behavior. I have been busy IRL but I'll submit a PR when I have some time for writing the code.

isaac-mcfadyen · 2025-05-17T12:28:27Z

Makes sense, and @matteoserva I'm happy to PR a flag if you're good with that.

Do we want opt-in or opt-out behavior for the flag? Personally I think opt-in might be better to prevent surprises, but given this is already added we could also do opt-out.

ngxson · 2025-05-17T14:20:29Z

Do we want opt-in or opt-out behavior for the flag?

I have no preference, but logically say, because we already introduced this as an "official" feature, so we want to avoid breaking change by allow opt-out

matteoserva · 2025-05-17T14:44:39Z

Yeah, I'm certainly happy if you submit the PR.

My vote is for opt-out.

isaac-mcfadyen · 2025-05-17T14:45:42Z

Opt-out makes sense, I'll see about PRing later today when I get the chance!

Prefilling assistant message in openai compatible API

e829173

matteoserva requested a review from ngxson as a code owner April 29, 2025 07:22

github-actions bot added examples server labels Apr 29, 2025

fixed indentation

9d96e5c

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

examples/server/utils.hpp Outdated Show resolved Hide resolved

matteo added 2 commits April 29, 2025 09:46

fixed code convention

496f08e

simplify method usage

79eb825

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Show resolved Hide resolved

no more than one assistant message at end of messages

0c316cd

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

merge checks into prefill code

cb7fe04

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

Update examples/server/utils.hpp

836015d

ngxson approved these changes Apr 29, 2025

View reviewed changes

ngxson merged commit e2e1ddb into ggml-org:master Apr 29, 2025
47 of 48 checks passed

ngxson mentioned this pull request Apr 30, 2025

changelog : llama-server REST API #9291

Open

matteoserva mentioned this pull request May 9, 2025

Feature Request: Prefix assistant answer #11536

Closed

4 tasks

isaac-mcfadyen mentioned this pull request May 17, 2025

server : added --no-prefill-assistant flag #13608

Merged

remixer-dec mentioned this pull request May 18, 2025

Feature Request: llama-server support continue_final_message #11755

Closed

4 tasks

Prefilling assistant message in openai compatible API #13174

Prefilling assistant message in openai compatible API #13174

Uh oh!

Conversation

matteoserva commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaac-mcfadyen commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

99991 commented May 9, 2025

Uh oh!

matteoserva commented May 9, 2025

Uh oh!

99991 commented May 9, 2025

Uh oh!

isaac-mcfadyen commented May 9, 2025

Uh oh!

matteoserva commented May 9, 2025

Uh oh!

isaac-mcfadyen commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-mcfadyen commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 17, 2025

Uh oh!

matteoserva commented May 17, 2025

Uh oh!

isaac-mcfadyen commented May 17, 2025

Uh oh!

ngxson commented May 17, 2025

Uh oh!

matteoserva commented May 17, 2025

Uh oh!

isaac-mcfadyen commented May 17, 2025

Uh oh!

Uh oh!

matteoserva commented Apr 29, 2025 •

edited

Loading

isaac-mcfadyen commented Apr 30, 2025 •

edited

Loading

isaac-mcfadyen commented May 9, 2025 •

edited

Loading

isaac-mcfadyen commented May 17, 2025 •

edited

Loading