Individual Word Timestamps - Kokoro TTS #278

staxxman · 2025-03-16T09:31:55Z

Hi and thank you for a great repo. Kokoro TTS now supports individual word timestamps in their output. Is that something you have for Kokoro (and possibly other models) in Realtime TTS as well? If not it would be an awesome feature.

Here is how to get it from Kokoro:
hexgrad/kokoro#32

KoljaB · 2025-03-17T23:27:55Z

Thanks for the inspiration, I integrated it in v0.4.51

staxxman · 2025-03-18T12:55:01Z

Thanks for the inspiration, I integrated it in v0.4.51

Awesome how can get the words and timestamps in your implementation in RealtimeTTS?

KoljaB · 2025-03-18T13:09:11Z

Check out the kokoro_test.py file and use the on_word callback. The object returned to the callback includes the properties "word" (the text), "start_time", and "end_time" (the time offsets in seconds for when the word starts and ends). This callback is triggered right when a word starts playing.

KoljaB · 2025-03-18T13:14:19Z

I hope I implemented it in a way that lets you use the feature as you intended. If you had a different approach in mind for the timestamps, please let me know, and I'll think about a way how to integrate it.

staxxman · 2025-03-18T16:43:43Z

Looks like a logical implementation, tried it but sometimes got errors like these

⚡ synthesizing → '1. Answer your questions: I've been trained on a vast amount of knowledge, so I can provide information on a wide range of topics, from science and history to entertainment and culture.'
Traceback (most recent call last):
File "/home/Ubuntu/.local/lib/python3.10/site-packages/RealtimeTTS/engines/kokoro_engine.py", line 275, in synthesize
t.start_ts + self.audio_duration,
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'
[KokoroEngine] Error generating audio: unsupported operand type(s) for +: 'NoneType' and 'float'
WARNING:root:engine unknown failed to synthesize sentence "1. Answer your questions: I've been trained on a vast amount of knowledge, so I can provide information on a wide range of topics, from science and history to entertainment and culture.", unknown error
WARNING:root:engine unknown is the only engine available, can't switch to another engine

I'm using play_async and muted=True, similar to the async_server example

KoljaB · 2025-03-18T16:56:50Z

Oh, that wasn't supposed to happen. I try to reproduce.

KoljaB · 2025-03-18T16:57:31Z

Probably me not handling the muted=True situation correctly, I guess. Will look into that.

KoljaB · 2025-03-18T17:06:23Z

Hmm, could not reproduce. Took the sentence from your log with async and muted=True. It did not raise the callback (because in muted case it gets ignored) but did not throw any errors too. Hopefully it's not OS-dependent. Could you share the code?

staxxman · 2025-03-18T17:26:10Z

Okay muted=True was probably why I wasn't seeing the word printout then. What I am trying to achieve is to send the words to to a web client so I can see the text as it is being spoken, like live subtitles. I think the error I got was because of the asterix '**' or numbers, I see here that the ** just made Answer your questions bold in the text i pasted here but it was actually 1. ** Answer your questions ** I got several of those errors with similar text, i.e the markdown syntax for bold text.

Some code, basically from the async_server example

class TTSRequestHandler:
def init(self, engine):
self.engine = engine
self.audio_queue = Queue()
self.stream = TextToAudioStream(
engine, on_audio_stream_stop=self.on_audio_stream_stop, muted=True, on_word=self.process_word
)
self.speaking = False

def process_word(self, word):
    print(word)
    # word timings only work for english voices (american and british)
    # global last_word
    # if last_word and word.word not in set(string.punctuation):
    #     print(" ", end="", flush=True)

    # print(f"{word.word}", end="", flush=True)
    # last_word = word.word

def on_audio_chunk(self, chunk):
    self.audio_queue.put(chunk)

def on_audio_stream_stop(self):
    self.audio_queue.put(None)
    self.speaking = False

def play_text_to_speech(self, text):
    self.speaking = True
    self.stream.feed(text)
    logging.debug(f"Playing audio for text: {text}")
    print(f'Synthesizing: "{text}"')
    self.stream.play_async(on_audio_chunk=self.on_audio_chunk, muted=True, log_synthesized_text=True)

async def audio_chunk_generator(self, send_wave_headers):
    first_chunk = False
    try:
        while True:
            chunk = await asyncio.to_thread(self.audio_queue.get)  # Non-blocking get
            if chunk is None:
                print("Terminating stream")
                break
            if not first_chunk:
                if send_wave_headers:
                    print("Sending wave header")
                    yield create_wave_header_for_engine(self.engine)
                first_chunk = True
            yield chunk
    except Exception as e:
        print(f"Error during streaming: {str(e)}")

KoljaB · 2025-03-19T22:20:30Z

Asterisk was the problem, it should be fixed now with v0.4.52

staxxman · 2025-03-20T08:05:44Z

Okay great, is it possible to get the word timestamps with Muted=True and play_async now as well? It would be needed to use the functionality when RealtimeTTS is run as an async server, streaming voice and text with timestamps to clients.

KoljaB · 2025-03-20T10:41:13Z

Yeah, agreed. Will integrate that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Individual Word Timestamps - Kokoro TTS #278

Individual Word Timestamps - Kokoro TTS #278

staxxman commented Mar 16, 2025

KoljaB commented Mar 17, 2025

staxxman commented Mar 18, 2025 •

edited

Loading

KoljaB commented Mar 18, 2025

KoljaB commented Mar 18, 2025

staxxman commented Mar 18, 2025

KoljaB commented Mar 18, 2025

KoljaB commented Mar 18, 2025

KoljaB commented Mar 18, 2025

staxxman commented Mar 18, 2025 •

edited

Loading

KoljaB commented Mar 19, 2025

staxxman commented Mar 20, 2025

KoljaB commented Mar 20, 2025

Individual Word Timestamps - Kokoro TTS #278

Individual Word Timestamps - Kokoro TTS #278

Comments

staxxman commented Mar 16, 2025

KoljaB commented Mar 17, 2025

staxxman commented Mar 18, 2025 • edited Loading

KoljaB commented Mar 18, 2025

KoljaB commented Mar 18, 2025

staxxman commented Mar 18, 2025

KoljaB commented Mar 18, 2025

KoljaB commented Mar 18, 2025

KoljaB commented Mar 18, 2025

staxxman commented Mar 18, 2025 • edited Loading

KoljaB commented Mar 19, 2025

staxxman commented Mar 20, 2025

KoljaB commented Mar 20, 2025

staxxman commented Mar 18, 2025 •

edited

Loading

staxxman commented Mar 18, 2025 •

edited

Loading