Skip to content

Individual Word Timestamps - Kokoro TTS #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
staxxman opened this issue Mar 16, 2025 · 12 comments
Open

Individual Word Timestamps - Kokoro TTS #278

staxxman opened this issue Mar 16, 2025 · 12 comments

Comments

@staxxman
Copy link

Hi and thank you for a great repo. Kokoro TTS now supports individual word timestamps in their output. Is that something you have for Kokoro (and possibly other models) in Realtime TTS as well? If not it would be an awesome feature.

Here is how to get it from Kokoro:
hexgrad/kokoro#32

@KoljaB
Copy link
Owner

KoljaB commented Mar 17, 2025

Thanks for the inspiration, I integrated it in v0.4.51

@staxxman
Copy link
Author

staxxman commented Mar 18, 2025

Thanks for the inspiration, I integrated it in v0.4.51

Awesome how can get the words and timestamps in your implementation in RealtimeTTS?

@KoljaB
Copy link
Owner

KoljaB commented Mar 18, 2025

Check out the kokoro_test.py file and use the on_word callback. The object returned to the callback includes the properties "word" (the text), "start_time", and "end_time" (the time offsets in seconds for when the word starts and ends). This callback is triggered right when a word starts playing.

@KoljaB
Copy link
Owner

KoljaB commented Mar 18, 2025

I hope I implemented it in a way that lets you use the feature as you intended. If you had a different approach in mind for the timestamps, please let me know, and I'll think about a way how to integrate it.

@staxxman
Copy link
Author

Looks like a logical implementation, tried it but sometimes got errors like these

⚡ synthesizing → '1. Answer your questions: I've been trained on a vast amount of knowledge, so I can provide information on a wide range of topics, from science and history to entertainment and culture.'
Traceback (most recent call last):
File "/home/Ubuntu/.local/lib/python3.10/site-packages/RealtimeTTS/engines/kokoro_engine.py", line 275, in synthesize
t.start_ts + self.audio_duration,
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'
[KokoroEngine] Error generating audio: unsupported operand type(s) for +: 'NoneType' and 'float'
WARNING:root:engine unknown failed to synthesize sentence "1. Answer your questions: I've been trained on a vast amount of knowledge, so I can provide information on a wide range of topics, from science and history to entertainment and culture.", unknown error
WARNING:root:engine unknown is the only engine available, can't switch to another engine

I'm using play_async and muted=True, similar to the async_server example

@KoljaB
Copy link
Owner

KoljaB commented Mar 18, 2025

Oh, that wasn't supposed to happen. I try to reproduce.

@KoljaB
Copy link
Owner

KoljaB commented Mar 18, 2025

Probably me not handling the muted=True situation correctly, I guess. Will look into that.

@KoljaB
Copy link
Owner

KoljaB commented Mar 18, 2025

Hmm, could not reproduce. Took the sentence from your log with async and muted=True. It did not raise the callback (because in muted case it gets ignored) but did not throw any errors too. Hopefully it's not OS-dependent. Could you share the code?

@staxxman
Copy link
Author

staxxman commented Mar 18, 2025

Okay muted=True was probably why I wasn't seeing the word printout then. What I am trying to achieve is to send the words to to a web client so I can see the text as it is being spoken, like live subtitles. I think the error I got was because of the asterix '**' or numbers, I see here that the ** just made Answer your questions bold in the text i pasted here but it was actually 1. ** Answer your questions ** I got several of those errors with similar text, i.e the markdown syntax for bold text.

Image

Some code, basically from the async_server example

class TTSRequestHandler:
def init(self, engine):
self.engine = engine
self.audio_queue = Queue()
self.stream = TextToAudioStream(
engine, on_audio_stream_stop=self.on_audio_stream_stop, muted=True, on_word=self.process_word
)
self.speaking = False

def process_word(self, word):
    print(word)
    # word timings only work for english voices (american and british)
    # global last_word
    # if last_word and word.word not in set(string.punctuation):
    #     print(" ", end="", flush=True)

    # print(f"{word.word}", end="", flush=True)
    # last_word = word.word

def on_audio_chunk(self, chunk):
    self.audio_queue.put(chunk)

def on_audio_stream_stop(self):
    self.audio_queue.put(None)
    self.speaking = False

def play_text_to_speech(self, text):
    self.speaking = True
    self.stream.feed(text)
    logging.debug(f"Playing audio for text: {text}")
    print(f'Synthesizing: "{text}"')
    self.stream.play_async(on_audio_chunk=self.on_audio_chunk, muted=True, log_synthesized_text=True)

async def audio_chunk_generator(self, send_wave_headers):
    first_chunk = False
    try:
        while True:
            chunk = await asyncio.to_thread(self.audio_queue.get)  # Non-blocking get
            if chunk is None:
                print("Terminating stream")
                break
            if not first_chunk:
                if send_wave_headers:
                    print("Sending wave header")
                    yield create_wave_header_for_engine(self.engine)
                first_chunk = True
            yield chunk
    except Exception as e:
        print(f"Error during streaming: {str(e)}")

@KoljaB
Copy link
Owner

KoljaB commented Mar 19, 2025

Asterisk was the problem, it should be fixed now with v0.4.52

@staxxman
Copy link
Author

Okay great, is it possible to get the word timestamps with Muted=True and play_async now as well? It would be needed to use the functionality when RealtimeTTS is run as an async server, streaming voice and text with timestamps to clients.

@KoljaB
Copy link
Owner

KoljaB commented Mar 20, 2025

Yeah, agreed. Will integrate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants