-
Notifications
You must be signed in to change notification settings - Fork 46
feat: kokoro tts support #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…oid yanked version, regenerated poetry lock
…re read alound as words by TTS model
…wo separate chunks
# To avoid yanked version 3.0.6 | ||
zarr = "!=3.0.6" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does zarr with the 3.0.6 break rai?
> [!WARNING] | ||
> It is not recommended to use device_name set to `'default'` in `SoundDeviceConfig` due to potential issues with audio. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> [!WARNING] | |
> It is not recommended to use device_name set to `'default'` in `SoundDeviceConfig` due to potential issues with audio. | |
> [!TIP] | |
> If you're experiencing audio issues and device_name is set to 'default', try specifying the exact device name instead, as this often resolves the problem. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add this note to the configurator.
def _preprocess_text(self, text: str) -> str: | ||
""" | ||
Preprocesses text by removing formatting characters that would be | ||
read aloud as words (like 'asterisk' for '*'). | ||
|
||
Parameters | ||
---------- | ||
text : str | ||
The input text that may contain formatting characters. | ||
|
||
Returns | ||
------- | ||
str | ||
The cleaned text with formatting characters removed. | ||
""" | ||
# Remove markdown headers (# symbols at start of line) | ||
text = re.sub(r"^#+\s*", "", text) | ||
|
||
# Remove bold markdown (** or __) | ||
text = re.sub(r"\*\*(.*?)\*\*", r"\1", text) | ||
text = re.sub(r"__(.*?)__", r"\1", text) | ||
|
||
# Remove italic markdown (* or _) | ||
text = re.sub(r"\*(.*?)\*", r"\1", text) | ||
text = re.sub(r"_(.*?)_", r"\1", text) | ||
|
||
return text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is kokoro pronouncing markdown elements?
Why does this method remove only a certain subset of markdown symbols?
) | ||
|
||
if samples.dtype == np.float32: | ||
samples = (samples * 32768).clip(-32768, 32767).astype(np.int16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting values outside of the provided range?
Clipping audio should only be used as a last resort, as it introduces massive quality degradation.
For available voices and languages supported within currently used version of the model - use `get_available_voices()` and `get_supported_languages()` methods of the `rai_s2s.tts.models.KokoroTTS` respectively. | ||
|
||
> [!NOTE] | ||
> You may encounter phonemizer warnings like "words count mismatch on x% of the lines". These warnings do not indicate that something is wrong with text to speech processing and can be safely ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we configure kokoro's logger to drop these warnings?
Purpose
Proposed Changes
Testing
poetry install --with s2s
With TTSAgent
python examples/s2s/tts.py
After a while, you should hear speech output from TTSAgent.
With ROS2S2SAgent
Run the following script and converse with agent:
The KokoroTTS model works well together with the ROS2S2SAgent.
My UX - It sounds nicer compared with OpenTTS. I didn't observe any significant differences in inference time between the models.
The model sometimes does not put space between the sentences.EDIT: It was fixed by setting trim to false in create method of Kokoro.