Skip to content

feat(responses): add output_text delta events to responses #2265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 27, 2025

Conversation

ashwinb
Copy link
Contributor

@ashwinb ashwinb commented May 26, 2025

This adds initial streaming support to the Responses API.

This PR makes sure that the first inference call made to chat completions streams out.

There's more to be done:

  • tool call output tokens need to stream out when possible
  • we need to loop through multiple rounds of inference and they all need to stream out.

Test Plan

Added a test. Executed as:

FIREWORKS_API_KEY=... \
  pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
  --provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct

Then, started a llama stack fireworks distro and tested against it like this:

OPENAI_API_KEY=blah \
   pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
   --base-url http://localhost:8321/v1/openai/v1 \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct 

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 26, 2025
Copy link
Contributor

@ehhuang ehhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions. Also, if you haven't run the verification tests with OpenAI's impl, would be good to do so just to verify that the tests are correctly checking for the official behavior.

Comment on lines 406 to 407
# Process response choices (tool execution and message creation)
output_messages = await self._process_response_choices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just name this function like _execute_tools or something more descriptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehhuang I think we need to simplify further because this function is muddled in how it thinks of itself :) the next set of PRs which do multi-turn execution will refactor it to be better, thanks for the feedback.

# Create a placeholder message item for delta events
message_item_id = f"msg_{uuid.uuid4()}"

async for chunk in inference_result:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async def stream_and_store_openai_completion(

Looks like we're doing these delta accumulations more (I recall seeing another instance somewhere, but can't recall the exact location), maybe some of the above can be reused. Could be a follow-up.

@ashwinb
Copy link
Contributor Author

ashwinb commented May 27, 2025

Tested against OpenAI client too. See updated test plan.

@ashwinb ashwinb merged commit 5cdb297 into meta-llama:main May 27, 2025
27 checks passed
@ashwinb ashwinb deleted the resp_stream branch May 27, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants