Skip to content

Azure STT transcription does not update the context aggregator instantly after speech fully recognized. #1440

Closed
@aurelien-ldp

Description

@aurelien-ldp

The AzureSTTService pushes the transcribed text (TranscriptionFrame) after the user has finished speaking.
It does not push any InterimTranscriptionFrame.

In the LLMUserContextAggregator code, when the TranscriptionFrame is received, we reset a timer _aggregation_event to 1s before calling push_aggregation().

# llm_response.py (LLMUserContextAggregator)
async def _handle_transcription(self, frame: TranscriptionFrame):
        self._aggregation += f" {frame.text}" if self._aggregation else frame.text
        # We just got a final result, so let's reset interim results.
        self._seen_interim_results = False
        # Reset aggregation timer.
        self._aggregation_event.set()

While it makes sense for most of the providers I guess, it seems that, with Azure, when we receive the transcription, we should be able to instantly call push_aggregation().

To fix locally (only works with Azure) I call push_aggregation() directly instead of resetting the timer.
I also tried to decrease aggregation_timeout; which works.

Not sure either of these solutions are valid project-wide.
I'd be happy to help, but need some light first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions