-
Notifications
You must be signed in to change notification settings - Fork 1k
Add support for previous text in elevenlabs http processor #1590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
🚀 New features to boost your workflow:
|
@danthegoodman1 in reading the docs, it seems like this feature is intended to provide audio continuity when a single response is split into segments. This does make sense to add and seems like something that should happen automatically. I'm curious about your proposal. It seems to take previous turns from the context to provide continuity to the output. This seems unexpected as the previous turns may have had different context. Anyway, I reached out to the 11Labs team to get a better understanding of how this feature should be used. |
How would they have different context? It’s captured at generation time. I can tell you that this massively improves audio quality. You don’t get any more random screaming for short sentences and such. |
I heard back from the 11Labs team. They're confirming that this feature is intended to ensure audio continuity when a single response is split into segments. For example, Pipecat breaks streaming LLM responses into sentences and send each sentence to TTS individually. previous_text is intended to include earlier sentences from the current response (i.e. turn) to maintain speech continuity. Given this, I think it makes sense to provide previous sentences from the same turn with subsequent generations. But, it seems like providing previous context messages from previous turns isn't the intended use case for this feature. |
I spent a bunch of time on ElevenLabsHttpTTSService tonight. At the tail end, I added: #1600, which I think is how we want |
This solution just uses the last N messages form the context, I don’t think you want to infinitely accumulate |
Which solution is that? |
This PR, controlled with context_max_previous_text |
Right, but context messages are not what should be added to For example, "Hello. I'm chatbot, your assistant. How can I help you today?" would be:
The previous_text is cleared after after turn ends. Providing messages from previous turns would skew the response as that information isn't being spoken contextually with the words produced by the TTS. |
Ah I see what you mean, I think your solution is better if you can trim the tail to some max length |
Turns are self limiting, so I'm really not concerned about that. previous_text resets after an interruption (StartInterruptionFrame, TTSStoppedFrame) or end of turn (LLMFullResponseEndFrame). So, this will prevent the text length from getting out of control. With that, I'll close this out. I'll leave your issue open (#1399). I'm hoping to get some tips from the 11Labs team on the single word case, as that's something we can still improve on. |
In favor of #1600? Want to get this feature merged one way or another |
Please describe the changes in your PR. If it is addressing an issue, please reference that as well.
Allows the user to optionally provide the context array for the previous
n
messages that the assistant has said to pass to theprevious_text
parameter to enable more natural sounding speech with elevenlabs.Only for the HTTP processor atm.
Closes #1399