Skip to content

Azure STT transcription does not update the context aggregator instantly after speech fully recognized. #1440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aurelien-ldp opened this issue Mar 24, 2025 · 2 comments

Comments

@aurelien-ldp
Copy link

The AzureSTTService pushes the transcribed text (TranscriptionFrame) after the user has finished speaking.
It does not push any InterimTranscriptionFrame.

In the LLMUserContextAggregator code, when the TranscriptionFrame is received, we reset a timer _aggregation_event to 1s before calling push_aggregation().

# llm_response.py (LLMUserContextAggregator)
async def _handle_transcription(self, frame: TranscriptionFrame):
        self._aggregation += f" {frame.text}" if self._aggregation else frame.text
        # We just got a final result, so let's reset interim results.
        self._seen_interim_results = False
        # Reset aggregation timer.
        self._aggregation_event.set()

While it makes sense for most of the providers I guess, it seems that, with Azure, when we receive the transcription, we should be able to instantly call push_aggregation().

To fix locally (only works with Azure) I call push_aggregation() directly instead of resetting the timer.
I also tried to decrease aggregation_timeout; which works.

Not sure either of these solutions are valid project-wide.
I'd be happy to help, but need some light first.

@markbackman
Copy link
Contributor

I just wrote this long reply to another user to explain how the system works and why the 1 sec timer exists. Hopefully this helps to explain things:


Some background:

  • The VAD and STT are two decoupled services that work in conjunction to represent when a user speaks and what they said.
  • The VAD runs locally and detects speech very fast; this is important for triggering an interruption quickly.
  • The STT service runs remotely and requires a network roundtrip in addition to generation; this usually runs more slowly.
  • The STT service can emit any number of interim and final transcripts for a given speech item. This is variable an uncontrollable.

The current system works to handle a wide variety of cases where the transcription frames are received at different times relative to the user speaking status. It's optimized to not trigger multiple completions back to back if you happen to get two consecutive final transcripts within close periods of time.

All of this means that it's not uncommon to see:

UserStartedSpeakingFrame
UserStoppedSpeakingFrame
TranscriptionFrame

In fact, because the VAD runs locally, it's very uncommon to get the TranscriptionFrame before the UserStoppedSpeakingFrame, simply because of the network transit time required to get the TranscriptionFrame.

This isn't a problem though, as the current logic is set up to handle this case and a number of other more complex, but also common cases.

For the cases you linked, I'm not sure any of them are an issue.

For example, the 1 second timeout that the user in #1440 mentions, this waiting period is to ensure that all TranscriptFrames have been received after speech. For example, without this, if you received:

UserStartedSpeakingFrame
UserStoppedSpeakingFrame
TranscriptionFrame (t=0s)
TranscriptionFrame (t=0.5s)

This would result in the LLM generating two completions, which would results in two text outputs and two TTS outputs. Because the user was still speaking, this could be two related parts of the same sentence, e.g.:

Hi, I'm good.
How are you?

The LLM response in this case may be weird, including the fact that it may repeat itself. In our experience, it's worth waiting a short bit of time to ensure the completions take into account the user's full turn.

@aurelien-ldp
Copy link
Author

Thanks for the explanation. I think it makes sense for most use-cases.

I just want to emphasize that for the way STT like Azure is implemented, we will always receive the TranscriptionFrame after the UserStoppedSpeakingFrame.

I might be wrong, but if this is true, we're just adding a free 1s latency to every user turn.
In my experience it's always the case.

Locally, the fix is to push_aggregation when receiving the TranscriptionFrame but it only works for some STT providers.
I'd be happy to help if you think it's something that can be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants