-
Notifications
You must be signed in to change notification settings - Fork 689
Azure STT transcription does not update the context aggregator instantly after speech fully recognized. #1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just wrote this long reply to another user to explain how the system works and why the 1 sec timer exists. Hopefully this helps to explain things: Some background:
The current system works to handle a wide variety of cases where the transcription frames are received at different times relative to the user speaking status. It's optimized to not trigger multiple completions back to back if you happen to get two consecutive final transcripts within close periods of time. All of this means that it's not uncommon to see:
In fact, because the VAD runs locally, it's very uncommon to get the This isn't a problem though, as the current logic is set up to handle this case and a number of other more complex, but also common cases. For the cases you linked, I'm not sure any of them are an issue. For example, the 1 second timeout that the user in #1440 mentions, this waiting period is to ensure that all TranscriptFrames have been received after speech. For example, without this, if you received:
This would result in the LLM generating two completions, which would results in two text outputs and two TTS outputs. Because the user was still speaking, this could be two related parts of the same sentence, e.g.:
The LLM response in this case may be weird, including the fact that it may repeat itself. In our experience, it's worth waiting a short bit of time to ensure the completions take into account the user's full turn. |
Thanks for the explanation. I think it makes sense for most use-cases. I just want to emphasize that for the way STT like Azure is implemented, we will always receive the I might be wrong, but if this is true, we're just adding a free 1s latency to every user turn. Locally, the fix is to |
The AzureSTTService pushes the transcribed text (
TranscriptionFrame
) after the user has finished speaking.It does not push any
InterimTranscriptionFrame
.In the
LLMUserContextAggregator
code, when theTranscriptionFrame
is received, we reset a timer_aggregation_event
to1s
before callingpush_aggregation()
.While it makes sense for most of the providers I guess, it seems that, with Azure, when we receive the transcription, we should be able to instantly call
push_aggregation()
.To fix locally (only works with Azure) I call
push_aggregation()
directly instead of resetting the timer.I also tried to decrease
aggregation_timeout
; which works.Not sure either of these solutions are valid project-wide.
I'd be happy to help, but need some light first.
The text was updated successfully, but these errors were encountered: