Skip to content

Add support for previous text in elevenlabs http processor #1590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

danthegoodman1
Copy link
Contributor

Please describe the changes in your PR. If it is addressing an issue, please reference that as well.

Allows the user to optionally provide the context array for the previous n messages that the assistant has said to pass to the previous_text parameter to enable more natural sounding speech with elevenlabs.

Only for the HTTP processor atm.

Closes #1399

Copy link

codecov bot commented Apr 14, 2025

Codecov Report

Attention: Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat/services/elevenlabs/tts.py 0.00% 11 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/services/elevenlabs/tts.py 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@markbackman
Copy link
Contributor

@danthegoodman1 in reading the docs, it seems like this feature is intended to provide audio continuity when a single response is split into segments. This does make sense to add and seems like something that should happen automatically.

I'm curious about your proposal. It seems to take previous turns from the context to provide continuity to the output. This seems unexpected as the previous turns may have had different context.

Anyway, I reached out to the 11Labs team to get a better understanding of how this feature should be used.

@danthegoodman1
Copy link
Contributor Author

danthegoodman1 commented Apr 15, 2025

How would they have different context? It’s captured at generation time.

I can tell you that this massively improves audio quality. You don’t get any more random screaming for short sentences and such.

@markbackman
Copy link
Contributor

I heard back from the 11Labs team. They're confirming that this feature is intended to ensure audio continuity when a single response is split into segments.

For example, Pipecat breaks streaming LLM responses into sentences and send each sentence to TTS individually. previous_text is intended to include earlier sentences from the current response (i.e. turn) to maintain speech continuity.

Given this, I think it makes sense to provide previous sentences from the same turn with subsequent generations. But, it seems like providing previous context messages from previous turns isn't the intended use case for this feature.

@markbackman markbackman requested a review from jamsea April 16, 2025 02:42
@markbackman
Copy link
Contributor

I spent a bunch of time on ElevenLabsHttpTTSService tonight. At the tail end, I added: #1600, which I think is how we want previous_text to be implemented. I'm inclined to close this PR @danthegoodman1 but am interested in your POV first.

@danthegoodman1
Copy link
Contributor Author

This solution just uses the last N messages form the context, I don’t think you want to infinitely accumulate

@markbackman
Copy link
Contributor

This solution just uses the last N messages form the context, I don’t think you want to infinitely accumulate

Which solution is that?

@danthegoodman1
Copy link
Contributor Author

This PR, controlled with context_max_previous_text

@markbackman
Copy link
Contributor

Right, but context messages are not what should be added to previous_text. It should be previous sentences from the current generation (i.e. bot's turn).

For example, "Hello. I'm chatbot, your assistant. How can I help you today?" would be:

  1. Input: "Hello.", previous_text: ""
  2. Input: "I'm chatbot, your assistant.", previous_text: "Hello."
  3. Input: "How can I help you today?", previous_text: "Hello. I'm chatbot, your assistant."

The previous_text is cleared after after turn ends.

Providing messages from previous turns would skew the response as that information isn't being spoken contextually with the words produced by the TTS.

@danthegoodman1
Copy link
Contributor Author

Ah I see what you mean, I think your solution is better if you can trim the tail to some max length

@markbackman
Copy link
Contributor

Ah I see what you mean, I think your solution is better if you can trim the tail to some max length

Turns are self limiting, so I'm really not concerned about that. previous_text resets after an interruption (StartInterruptionFrame, TTSStoppedFrame) or end of turn (LLMFullResponseEndFrame). So, this will prevent the text length from getting out of control.

With that, I'll close this out. I'll leave your issue open (#1399). I'm hoping to get some tips from the 11Labs team on the single word case, as that's something we can still improve on.

@danthegoodman1
Copy link
Contributor Author

In favor of #1600? Want to get this feature merged one way or another

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elevenlabs use previous_text to improve generation
3 participants