Skip to content

IF the input audio for users is less than start_secs of VAD still the interruption is getting triggered #1391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sphatate opened this issue Mar 18, 2025 · 11 comments
Assignees

Comments

@sphatate
Copy link

sphatate commented Mar 18, 2025

We have VAD and STT + LLM + TTS setup.

We have kept the vad start_secs to 0.6 but still when VAD doesn't detect interruption for "hmm" , "ok", "fine" etc. the STT does and the bot is interrupted.

Ideally, the interruption should not be detected or audio should not be passed to STT.

Due to this, if bot is currently speaking and if users say something less than start_secs then bot completes its answer and then immediately gives answer to was user said as interruption.

@sphatate sphatate changed the title When bot is in speaking state, is there a way to interrupt speech only if the user has spoken more than N number of words is the input audio for users is less than start_secs of VAD still the interruption is getting triggered Mar 19, 2025
@sphatate sphatate changed the title is the input audio for users is less than start_secs of VAD still the interruption is getting triggered IF the input audio for users is less than start_secs of VAD still the interruption is getting triggered Mar 19, 2025
@zoahmed-xyz
Copy link

Can confirm this has been happening for us as well.

@markbackman
Copy link
Contributor

We have kept the vad start_secs to 0.6 but still when VAD doesn't detect interruption for "hmm" , "ok", "fine" etc. the STT does and the bot is interrupted.

This is intentional behavior. Here's some info to consider:

  • The VAD is used to detect if a user is speaking. This provides the fastest and most reliable signal, so Pipecat can interrupt the bot when the user is starting to speak.
  • There are times when the user speaks and the VAD doesn't trigger, but a transcription is produced. This was happening with short utterances, like "Yup", "No", "OK", etc.. We had an overwhelming amount of feedback that this was causing problems and user frustration; specifically, the user would speak, but the bot would not response.

As a result, we modified Pipecat's logic such that interruptions occur when:

  • VAD detects the start of speech
  • The STT service generates a final transcript

Both are signals of a user's intention so speak. This helps to solve the "short utterances" problem. We made additional improvements in 0.0.60 that should improve this.

The VAD start_secs default is 0.2 sec. May I ask why you've increased the time to 0.6 sec?

@markbackman markbackman self-assigned this Mar 21, 2025
@sphatate
Copy link
Author

Thanks @markbackman for clarification.

We have kept the start_secs in range of 0.4 to 0.6 because we want the bot to detect interruption only if user has spoken for minimum that many seconds.

Suggestion: Can we have a flag like allow_short_interruption=True/False in VAD

  • If "True" then the audio is passed from VAD to STT even for short utterance like "ok", "hmm, "ha" etc.
    In such cases interruptions occur when:
    • VAD detects the start of speech
    • The STT service generates a final transcript
  • If "False" then stat_secs is followed and only if the users has spoken min start_sec then audio is passed to VAD and interruption is detected.

@markbackman
Copy link
Contributor

Good to know. The issue with changing the start_secs like that is if the bot is not speaking and the user says, "Yes" to a bot's yes or not question, then the bot will ignore. This would be a poor user experience.

What if, while the bot is speaking, the user can only interrupt after they've spoken N number of words, where N is configurable?

That would allow:

  • The VAD to be set to 0.2 sec for fast speech detection in all cases
  • But only interrupt the bot (while the bot is speaking) if N words are detected in the TranscriptionFrame

In this algorithm, I think N being 2 would be a good setting, as it would filter out affirmation words that are natural to say while someone speaks.

Would that be a viable solution?

@sphatate
Copy link
Author

Yes, having N number of words spoken will be a viable solution

@markbackman
Copy link
Contributor

This type of logic is something we're interested in adding to Pipecat soon. I'd say within the next month.

@devedgecio
Copy link

Good to know. The issue with changing the start_secs like that is if the bot is not speaking and the user says, "Yes" to a bot's yes or not question, then the bot will ignore. This would be a poor user experience.

What if, while the bot is speaking, the user can only interrupt after they've spoken N number of words, where N is configurable?

That would allow:

  • The VAD to be set to 0.2 sec for fast speech detection in all cases
  • But only interrupt the bot (while the bot is speaking) if N words are detected in the TranscriptionFrame

In this algorithm, I think N being 2 would be a good setting, as it would filter out affirmation words that are natural to say while someone speaks.

Would that be a viable solution?

@markbackman The problem with that approach is that if we set N equal to 2, some filler words might also have a count of 2 (e.g., "yeah yeah," "oh okay," "mhmm okay," etc.). I believe the solution should be more robust—for example, it should ignore filler words. However, if the bot is asking a question and the user interrupts with "yes," "no," or "yeah sure," it should recognize that as an actual interruption.

Basing the approach solely on the number of words wouldn't be as effective as enabling the bot to understand the context and handle interruptions accordingly.

Moreover, if the user interrupts the bot, there should be an option to restart the interrupted response from where the user left off. This could enhance the user experience. In that case, developers could leverage an LLM to handle the scenario by instructing it as follows:

  • If the user responds with a filler word, repeat the same response.
  • If the user asks a new question, answer it accordingly.

@markbackman
Copy link
Contributor

The problem with that approach is that if we set N equal to 2, some filler words might also have a count of 2 (e.g., "yeah yeah," "oh okay," "mhmm okay," etc.). I believe the solution should be more robust—for example, it should ignore filler words.

Ignoring filler words means string matching, which is not a robust solution. Users say a lot of things and transcripts are not perfectly accurate. We'll think through this, but I'm hesitant to string match. (You can always build your own version that does this!)

However, if the bot is asking a question and the user interrupts with "yes," "no," or "yeah sure," it should recognize that as an actual interruption.

This is even more difficult! You're talking about contextual interruptions. That is, the code handling the interruption detection has to be aware that the bot is asking a question and what the user is saying is a response to the question. You could use an LLM for this but that adds cost and latency. So, that's not a viable solution.

So, while number of words filtering is not a perfect solution, it offers an option for filtering out short responses while the bot is speaking. The idea being, if the user was actually ignored when they shouldn't have been, they'll escalate with the bot to force an interruption (e.g. say more until the bot listens).

Moreover, if the user interrupts the bot, there should be an option to restart the interrupted response from where the user left off.

This already happens but in a natural language way. If you're using Cartesia or Rime, the context is update with only the words that bot has spoken (e.g. what the user has heard). The bot uses this context and the latest user inputs to determine whether it should repeat itself or pursue the new thought from the user. You can already test this by just interrupting with "OK". The bot will mostly just keep going (assuming you're using Cartesia or Rime). For example, ask the bot to tell you a story. While it's telling the story, just say "OK". It will keep telling it. Conversely, interrupt, asking it to tell you a joke; it will skip the story and tell the joke. This is already pretty good in my experience!

Just to point out: what you're sharing makes sense, but we (the entire tech community) don't have the tools available to solve these problems yet. As we get better models that help us understand the semantics, we can add these tools to make the conversation more natural.

@devedgecio
Copy link

devedgecio commented Apr 3, 2025

@markbackman Thanks for your response!

This already happens but in a natural language way. If you're using Cartesia or Rime, the context is update with only the words that bot has spoken (e.g. what the user has heard). The bot uses this context and the latest user inputs to determine whether it should repeat itself or pursue the new thought from the user. You can already test this by just interrupting with "OK". The bot will mostly just keep going (assuming you're using Cartesia or Rime). For example, ask the bot to tell you a story. While it's telling the story, just say "OK". It will keep telling it. Conversely, interrupt, asking it to tell you a joke; it will skip the story and tell the joke. This is already pretty good in my experience!

I am using Deepgram TTS and STT services. I tried to control the interruption on my end. For example, if the user says a filler word, the sentence is repeated, and if the user asks a new question, it is handled accordingly. However, the issue is that instead of continuing the response from the point where the user interrupts, it repeats the response from the beginning. Are you saying that Cartesia or Rime will resolve this issue?

@markbackman
Copy link
Contributor

Are you saying that Cartesia or Rime will resolve this issue?

Yes, CartesiaTTSService, RimeTTSService, and ElevenLabsTTSService all offer word/timestamp pairs. That means that each API provides Pipecat with the exact word timing. Pipecat uses that timing to update the context with the exact words that are communicated by the bot. So, if the bot is interrupted mid sentence, the context reflects that. With other TTS services that don't offer word/timestamp pairs, the context receives whatever entire phrases were generated. Often that might be the entire bot's turn of output.

Anyway, give it a try.

@devedgecio
Copy link

@markbackman Okay, thanks! I'll give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants