Skip to content

feat: kokoro tts support #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Conversation

MagdalenaKotynia
Copy link
Member

@MagdalenaKotynia MagdalenaKotynia commented Jun 25, 2025

Purpose

  • To support the usage of Kokoro-TTS model. Kokoro-TTS was selected based on its high-quality speech output, small size, and potential to run on edge devices (it is in ONNX format)

Proposed Changes

  • Developed a class implementing the TTSModel interface for the Kokoro-TTS model.
  • Updated docs with newly supported model.
  • Updated example with TTSAgent to be able to use the newly supported model

Testing

With TTSAgent

  • Run TTSAgent example: python examples/s2s/tts.py
  • In another terminal run the following script to send ros2hri message to ros2 topic:
from rai.communication.ros2.connectors import ROS2HRIConnector
from rai.communication.ros2.messages import ROS2HRIMessage
import rclpy
import time

rclpy.init()
my_hri_msg = ROS2HRIMessage(
    text="Hello, human! This is a test message. How are you?",
    message_author="ai",
)

hri_connector = ROS2HRIConnector()

hri_connector.send_message(
    message=my_hri_msg,
    target="/to_human"
)

try:
    print("Sending message... Press Ctrl+C to exit")
    time.sleep(10)
    
except KeyboardInterrupt:
    print("Shutting down...")
finally:
    hri_connector.shutdown()
    rclpy.shutdown()

After a while, you should hear speech output from TTSAgent.

With ROS2S2SAgent

Run the following script and converse with agent:

from rai_s2s.sound_device import SoundDeviceConfig
from rai.communication.ros2 import ROS2Context
from rai_s2s.s2s.agents.s2s_agent import SpeechToSpeechAgent
from rai_s2s.s2s.agents.ros2s2s_agent import ROS2S2SAgent
from rai.agents.langchain.react_agent import ReActAgent
from rai_s2s.asr.models import OpenAIWhisper, SileroVAD
from rai_s2s import KokoroTTS

from rai.agents import AgentRunner


@ROS2Context()
def main():
    speaker_config = SoundDeviceConfig(
        stream=True,
        is_output=True,
        # device_name="EPOS PC 8 USB: Audio (hw:1,0)",
        # device_name="Sennheiser USB headset: Audio (hw:1,0)",
        # device_name="Jabra Speak2 40 MS: USB Audio (hw:2,0)",
        device_name="default",
    )

    microphone_config = SoundDeviceConfig(
        stream=True,
        channels=1,
        device_name="default",
        consumer_sampling_rate=16000,
        dtype="int16",
        is_input=True,
    )

    # whisper = LocalWhisper("tiny", 16000)
    whisper = OpenAIWhisper("gpt-4o-mini-transcribe", 16000)
    vad = SileroVAD(16000, 0.5)
    
    tts = KokoroTTS()

    agent = ROS2S2SAgent(
        from_human_topic="/from_human",
        to_human_topic="/to_human",
        microphone_config=microphone_config,
        speaker_config=speaker_config,
        transcription_model=whisper,
        vad=vad,
        tts=tts,
    )
    from rai.communication.ros2 import ROS2HRIConnector

    hri_connector = ROS2HRIConnector()
    llm = ReActAgent(
        target_connectors={"/to_human": hri_connector},
    )
    llm.subscribe_source("/from_human", hri_connector)
    runner = AgentRunner([agent, llm])
    runner.run_and_wait_for_shutdown()


if __name__ == "__main__":
    main()

The KokoroTTS model works well together with the ROS2S2SAgent.
My UX - It sounds nicer compared with OpenTTS. I didn't observe any significant differences in inference time between the models.
The model sometimes does not put space between the sentences. EDIT: It was fixed by setting trim to false in create method of Kokoro.

@MagdalenaKotynia MagdalenaKotynia marked this pull request as ready for review June 26, 2025 13:35
@MagdalenaKotynia MagdalenaKotynia requested review from boczekbartek and removed request for boczekbartek June 26, 2025 17:45
Comment on lines +24 to +26
# To avoid yanked version 3.0.6
zarr = "!=3.0.6"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does zarr with the 3.0.6 break rai?

Comment on lines +34 to +36
> [!WARNING]
> It is not recommended to use device_name set to `'default'` in `SoundDeviceConfig` due to potential issues with audio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> [!WARNING]
> It is not recommended to use device_name set to `'default'` in `SoundDeviceConfig` due to potential issues with audio.
> [!TIP]
> If you're experiencing audio issues and device_name is set to 'default', try specifying the exact device name instead, as this often resolves the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add this note to the configurator.

Comment on lines +292 to +318
def _preprocess_text(self, text: str) -> str:
"""
Preprocesses text by removing formatting characters that would be
read aloud as words (like 'asterisk' for '*').

Parameters
----------
text : str
The input text that may contain formatting characters.

Returns
-------
str
The cleaned text with formatting characters removed.
"""
# Remove markdown headers (# symbols at start of line)
text = re.sub(r"^#+\s*", "", text)

# Remove bold markdown (** or __)
text = re.sub(r"\*\*(.*?)\*\*", r"\1", text)
text = re.sub(r"__(.*?)__", r"\1", text)

# Remove italic markdown (* or _)
text = re.sub(r"\*(.*?)\*", r"\1", text)
text = re.sub(r"_(.*?)_", r"\1", text)

return text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kokoro pronouncing markdown elements?
Why does this method remove only a certain subset of markdown symbols?

)

if samples.dtype == np.float32:
samples = (samples * 32768).clip(-32768, 32767).astype(np.int16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting values outside of the provided range?
Clipping audio should only be used as a last resort, as it introduces massive quality degradation.

For available voices and languages supported within currently used version of the model - use `get_available_voices()` and `get_supported_languages()` methods of the `rai_s2s.tts.models.KokoroTTS` respectively.

> [!NOTE]
> You may encounter phonemizer warnings like "words count mismatch on x% of the lines". These warnings do not indicate that something is wrong with text to speech processing and can be safely ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we configure kokoro's logger to drop these warnings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants