Skip to content

docs: speech to speech #593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 28, 2025
14 changes: 6 additions & 8 deletions docs/API_documentation/connectors/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,9 @@ Extends BaseConnector with Human-Robot Interaction capabilities:

### Concrete Implementations

| Connector | Description | Documentation Link |
| ---------------------- | -------------------------------------- | ------------------------------------------ |
| ROS 2 Connectors | Robot Operating System 2 integration | [ROS2 Connectors](./ROS_2_Connectors.md) |
| Sound Device Connector | Audio streaming and playback/recording | [Sound Device Connector](./sounddevice.md) |
| Connector | Description | Documentation Link |
| ---------------- | ------------------------------------ | ---------------------------------------- |
| ROS 2 Connectors | Robot Operating System 2 integration | [ROS2 Connectors](./ROS_2_Connectors.md) |

## Key Features

Expand Down Expand Up @@ -89,10 +88,9 @@ Connectors implement thread-safe operations:

## Usage Examples

| Connector | Example Usage Documentation |
| ------------ | -------------------------------------------------------- |
| ROS 2 | [ROS2 Connectors](./ROS_2_Connectors.md#example-usage) |
| Sound Device | [Sound Device Connector](./sounddevice.md#example-usage) |
| Connector | Example Usage Documentation |
| --------- | ------------------------------------------------------ |
| ROS 2 | [ROS2 Connectors](./ROS_2_Connectors.md#example-usage) |

## Error Handling

Expand Down
83 changes: 83 additions & 0 deletions docs/speech_to_speech/agents/asr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# SpeechRecognitionAgent

## Overview

The `SpeechRecognitionAgent` in the RAI framework is a specialized agent that performs voice activity detection (VAD), audio recording, and transcription. It integrates tightly with audio input sources and ROS2 messaging, allowing it to serve as a real-time voice interface for robotic systems.

This agent manages multiple pipelines for detecting when to start and stop recording, performs transcription using configurable models, and broadcasts messages to relevant ROS2 topics.

## Class Definition

??? info "SpeechRecognitionAgent class definition"

::: rai_s2s.asr.agents.asr_agent.SpeechRecognitionAgent

## Purpose

The `SpeechRecognitionAgent` class enables real-time voice processing with the following responsibilities:

- Detecting speech through VAD
- Managing recording state and grace periods
- Buffering and threading transcription processes
- Publishing transcriptions and control messages to ROS2 topics
- Supporting multiple VAD and transcription model types

## Initialization Parameters

| Parameter | Type | Description |
| --------------------- | -------------------------- | ----------------------------------------------------------------------------- |
| `microphone_config` | `SoundDeviceConfig` | Configuration for the microphone input. |
| `ros2_name` | `str` | Name of the ROS2 node. |
| `transcription_model` | `BaseTranscriptionModel` | Model instance for transcribing speech. |
| `vad` | `BaseVoiceDetectionModel` | Model for detecting voice activity. |
| `grace_period` | `float` | Time (in seconds) to continue buffering after speech ends. Defaults to `1.0`. |
| `logger` | `Optional[logging.Logger]` | Logger instance. If `None`, defaults to module logger. |

## Key Methods

### `from_config()`

Creates a `SpeechRecognitionAgent` instance from a YAML config file. Dynamically loads the required transcription and VAD models.

### `run()`

Starts the microphone stream and handles incoming audio samples.

### `stop()`

Stops the agent gracefully, joins all running transcription threads, and shuts down ROS2 connectors.

### `add_detection_model(model, pipeline="record")`

Adds a custom VAD model to a processing pipeline.

- `pipeline` can be either `'record'` or `'stop'`

!!! note "`'stop'` pipeline"

The `'stop'` pipeline is present for forward compatibility. It currently doesn't affect Agent's functioning.

## Best Practices

1. **Graceful Shutdown**: Always call `stop()` to ensure transcription threads complete.
2. **Model Compatibility**: Ensure all transcription and VAD models are compatible with the sample rate (typically 16 kHz).
3. **Thread Safety**: Use provided locks for shared state, especially around the transcription model.
4. **Logging**: Utilize `self.logger` for debug and info logs to aid in tracing activity.
5. **Config-driven Design**: Use `from_config()` to ensure modular and portable deployment.

## Architecture

The `SpeechRecognitionAgent` typically interacts with the following components:

- **SoundDeviceConnector**: Interfaces with microphone audio input.
- **BaseVoiceDetectionModel**: Determines whether speech is present.
- **BaseTranscriptionModel**: Converts speech audio into text.
- **ROS2Connector / ROS2HRIConnector**: Publishes transcription and control messages to ROS2 topics.
- **Config Loader**: Dynamically creates agent from structured config files.

## See Also

- [BaseAgent](../../API_documentation/agents/overview.md): Abstract agent class providing lifecycle and logging support.
- [ROS2 Connectors](../../API_documentation/connectors/ROS_2_Connectors.md): Communication layer for ROS2 topics.
- [Models](../models/overview.md): For available voice based models and instructions for creating new ones.
- [TextToSpeech](tts.md): For TextToSpeechAgent meant for distributed deployment.
45 changes: 45 additions & 0 deletions docs/speech_to_speech/agents/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# S2S Agents

## Overview

Agents in RAI are modular components that encapsulate specific functionalities and behaviors. They follow a consistent interface defined by the `BaseAgent` class and can be combined to create complex robotic systems. The Speech to Speech Agents are used for voice-based interaction, and communicate with other agents.

## SpeechToSpeechAgent

`SpeechToSpeechAgent` is the abstract base class for locally deployable S2S Agents. It provides functionality to manage sound device integration, as well as defines the communication schema for integration with the rest of the system.

### Class Definition

??? info "SpeechToSpeechAgent class definition"

::: rai_s2s.s2s.agents.SpeechToSpeechAgent

### Communication

The Agent communicates through two communication channels provided during initialization - `from_human` and `to_human`.
On the `from_human` channel text transcribed from human voice is published.
On the `to_human` channel receives text to be played to the human through text-to-speech.

### Voice interaction

The voice interaction is performed through two audio streams, with two devices.
These devices can be different, but don't have to - and in case of most local deployments they will be the same.
The list of available sounddevices for configuration can be obtained by running `python -c "import sounddevice as sd; print(sd.query_devices())"`.
The configuration requires the user to specify the name of the sound device to be used for interfacing.
This is the entire string from the index until the comma before the hostapi (typically `ALSA` on Ubuntu).

The voice interaction works as follows: - The user speaks, which leads to the `VoiceActivityDetection` model activation. - \[Optional\] the recording pipeline (containing other models like [OpenWakeWord](../models/overview.md)) runs checks. - The recording starts. - The recording continues until the user stops talking (based on silence grace period). - The recording is transcribed and sent to the system. - The Agent receives text data to be played to the user. - The playback begins. - The playback can be interrupted by user speaking: - if there is additional recording pipeline the playback will pause while the user speaks (and continue, if the pipeline returns false). - otherwise the new recording will be send to the system, and transcription will stop the playback.

### Implementations

ROS based implementation is available in `ROS2S2SAgent`.

??? info "ROS2S2SAgent class definition"

::: rai_s2s.s2s.agents.ros2s2s_agent.ROS2S2SAgent

## See Also

- [Models](../models/overview.md): For available voice based models and instructions for creating new ones.
- [AutomaticSpeechRecognition](asr.md): For AutomaticSpeechRecognitionAgent meant for distributed deployment.
- [TextToSpeech](tts.md): For TextToSpeechAgent meant for distributed deployment.
78 changes: 78 additions & 0 deletions docs/speech_to_speech/agents/tts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# TextToSpeechAgent

## Overview

The `TextToSpeechAgent` in the RAI framework is a modular agent responsible for converting incoming text into audio using a text-to-speech (TTS) model and playing it through a configured audio output device. It supports real-time playback control through ROS2 messages and handles asynchronous speech processing using threads and queues.

## Class Definition

??? info "TextToSpeechAgent class definition"

::: rai_s2s.tts.agents.TextToSpeechAgent

## Purpose

The `TextToSpeechAgent` enables:

- Real-time conversion of text to speech
- Playback control (play/pause/stop) via ROS2 messages
- Dynamic loading of TTS models from configuration
- Robust audio handling using queues and event-driven logic
- Integration with human-robot interaction topics (HRI)

## Initialization Parameters

| Parameter | Type | Description |
| -------------------- | -------------------------- | ------------------------------------------------------- |
| `speaker_config` | `SoundDeviceConfig` | Configuration for the audio output (speaker). |
| `ros2_name` | `str` | Name of the ROS2 node. |
| `tts` | `TTSModel` | Text-to-speech model instance. |
| `logger` | `Optional[logging.Logger]` | Logger instance, or default logger if `None`. |
| `max_speech_history` | `int` | Number of speech message IDs to remember (default: 64). |

## Key Methods

### `from_config(cfg_path: Optional[str])`

Instantiates the agent from a configuration file, dynamically selecting the TTS model and setting up audio output.

### `run()`

Initializes the agent:

- Starts a thread to handle queued text-to-speech conversion
- Launches speaker playback via `SoundDeviceConnector`

### `stop()`

Gracefully stops the agent by setting the termination flag and joining the transcription thread.

## Communication

The Agent uses the `ROS2HRIConnector` for connection through 2 ROS2 topics:

- `/to_human`: Incoming text messages to convert. Uses `rai_interfaces/msg/HRIMessage`.
- `/voice_commands`: Playback control with ROS2 `std_msgs/msg/String`. Valid values: `"play"`, `"pause"`, `"stop"`

## Best Practices

1. **Queue Management**: Properly track transcription IDs to avoid queue collisions or memory leaks.
2. **Playback Sync**: Ensure audio queues are flushed on `stop` to avoid replaying outdated speech.
3. **Graceful Shutdown**: Always call `stop()` to terminate threads cleanly.
4. **Model Configuration**: Ensure model-specific settings (e.g., voice selection for ElevenLabs) are defined in config files.

## Architecture

The `TextToSpeechAgent` interacts with the following core components:

- **TTSModel**: Converts text into audio (e.g., ElevenLabsTTS, OpenTTS)
- **SoundDeviceConnector**: Sends synthesized audio to output hardware
- **ROS2HRIConnector**: Handles incoming HRI and command messages
- **Queues and Threads**: Enable asynchronous and buffered audio processing

## See Also

- [BaseAgent](../../API_documentation/agents/overview.md#baseagent): Abstract base for all agents in RAI
- [SoundDeviceConnector](../sounddevice.md): For details on speaker configuration and streaming
- [Text-to-Speech Models](../models/overview.md): Supported TTS engines and usage
- [ROS2 HRI Messaging](../../API_documentation/connectors/ROS_2_Connectors.md): Interfacing with `/to_human` and `/voice_commands`
152 changes: 152 additions & 0 deletions docs/speech_to_speech/models/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Models

## Overview

This package provides three primary types of models:

- **Voice Activity Detection (VAD)**
- **Wake Word Detection**
- **Transcription**

These models are designed with simple and consistent interfaces to allow chaining and integration into audio processing pipelines.

## Model Interfaces

### VAD and Wake Word Detection API

All VAD and Wake Word detection models implement a common `detect` interface:

```python
def detect(
self, audio_data: NDArray, input_parameters: dict[str, Any]
) -> Tuple[bool, dict[str, Any]]:
```

This design supports chaining multiple models together by passing the output dictionary (`input_parameters`) from one model into the next.

### Transcription API

Transcription models implement the `transcribe` method:

```python
def transcribe(self, data: NDArray[np.int16]) -> str:
```

This method takes raw audio data encoded as 2-byte integers and returns the corresponding text transcription.

## Included Models

### SileroVAD

- Open source model: [GitHub](https://github.com/snakers4/silero-vad)
- No additional setup required
- Returns a confidence value indicating the presence of speech in the audio

??? info "SileroVAD"

::: rai_s2s.asr.models.silero_vad.SileroVAD

### OpenWakeWord

- Open source project: [GitHub](https://github.com/dscripka/openWakeWord)
- Supports predefined and custom wake words
- Returns `True` when the specified wake word is detected in the audio

??? info "OpenWakeWord"

::: rai_s2s.asr.models.open_wake_word.OpenWakeWord

### OpenAIWhisper

- Cloud-based transcription model: [Documentation](https://platform.openai.com/docs/guides/speech-to-text)
- Requires setting the `OPEN_API_KEY` environment variable
- Offers language and model customization via the API

??? info "OpenAIWhisper"

::: rai_s2s.asr.models.open_ai_whisper.OpenAIWhisper

### LocalWhisper

- Local deployment of OpenAI Whisper: [GitHub](https://github.com/openai/whisper)
- Supports GPU acceleration
- Same configuration interface as OpenAIWhisper

??? info "LocalWhisper"

::: rai_s2s.asr.models.local_whisper.LocalWhisper

### FasterWhisper

- Optimized Whisper variant: [GitHub](https://github.com/SYSTRAN/faster-whisper)
- Designed for high speed and low memory usage
- Follows the same API as Whisper models

??? info "FasterWhisper"

::: rai_s2s.asr.models.local_whisper.FasterWhisper

### ElevenLabs

- Cloud-based TTS model: [Website](https://elevenlabs.io/)
- Requires the environment variable `ELEVENLABS_API_KEY` with a valid key

??? info "ElevenLabs"

::: rai_s2s.tts.models.elevenlabs_tts.ElevenLabsTTS

### OpenTTS

- Open source TTS solution: [GitHub](https://github.com/synesthesiam/opentts)
- Easy setup via Docker:

```bash
docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak
```

- Provides a TTS server running on port 5500
- Supports multiple voices and configurations

??? info "OpenTTS"

::: rai_s2s.tts.models.open_tts.OpenTTS

## Custom Models

### Voice Detection Models

To implement a custom VAD or Wake Word model, inherit from `rai_asr.base.BaseVoiceDetectionModel` and implement the following methods:

```python
class MyDetectionModel(BaseVoiceDetectionModel):
def detect(self, audio_data: NDArray, input_parameters: dict[str, Any]) -> Tuple[bool, dict[str, Any]]:
...

def reset(self):
...
```

### Transcription Models

To implement a custom transcription model, inherit from `rai_asr.base.BaseTranscriptionModel` and implement:

```python
class MyTranscriptionModel(BaseTranscriptionModel):
def transcribe(self, data: NDArray[np.int16]) -> str:
...
```

### TTS Models

To create a custom TTS model, inherit from `rai_tts.models.base.TTSModel` and implement the required interface:

```python
class MyTTSModel(TTSModel):
def get_speech(self, text: str) -> AudioSegment:
...
return AudioSegment()

def get_tts_params(self) -> Tuple[int, int]:
...
return sample_rate, channels
```
Loading