Skip to content

Alignment issues after #986 (support timestamp for numbers) #1016

Open
@MarkMLCode

Description

@MarkMLCode

It seems that the fix recently implemented in #986 (support timestamp for numbers) causes issues with the alignment of the last word in a segment. Whenever there is a sound at the end of the file, it seems that the entire space between the last word and the noise is now detected as the last word (about a second in my test). This even places the end of the word after the total duration of the file. In fact, I've noticed it doing this even when the file has no noise at the end (that is, it detects the last word a little after the end duration of the file).

File used : test.wav
test sound file.zip

File duration : 7.8
Last word detected in the list (before the change): {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}
Last word detected in the list (after the change): {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}

Transcribe function results :
{'segments': [{'text': ' I focus my energy, aiming a forceful magic missile at the remaining spectres and try to send them reeling back,', 'start': 0.031, 'end': 7.827}], 'language': 'en'}

Align (before the change) :

[{'word': 'I', 'start': 0.131, 'end': 0.232, 'score': 0.822}, {'word': 'focus', 'start': 0.373, 'end': 0.795, 'score': 0.821}, {'word': 'my', 'start': 0.835, 'end': 0.975, 'score': 0.953}, {'word': 'energy,', 'start': 1.136, 'end': 1.558, 'score': 0.72}, {'word': 'aiming', 'start': 1.98, 'end': 2.241, 'score': 0.857}, {'word': 'a', 'start': 2.301, 'end': 2.342, 'score': 0.5}, {'word': 'forceful', 'start': 2.442, 'end': 2.904, 'score': 0.893}, {'word': 'magic', 'start': 2.965, 'end': 3.286, 'score': 0.999}, {'word': 'missile', 'start': 3.366, 'end': 3.688, 'score': 0.748}, {'word': 'at', 'start': 3.748, 'end': 3.829, 'score': 0.744}, {'word': 'the', 'start': 3.849, 'end': 3.929, 'score': 0.825}, {'word': 'remaining', 'start': 3.969, 'end': 4.371, 'score': 0.719}, {'word': 'spectres', 'start': 4.431, 'end': 4.934, 'score': 0.851}, {'word': 'and', 'start': 5.295, 'end': 5.396, 'score': 0.817}, {'word': 'try', 'start': 5.456, 'end': 5.717, 'score': 0.937}, {'word': 'to', 'start': 5.757, 'end': 5.818, 'score': 0.777}, {'word': 'send', 'start': 5.878, 'end': 6.099, 'score': 0.89}, {'word': 'them', 'start': 6.119, 'end': 6.28, 'score': 0.865}, {'word': 'reeling', 'start': 6.421, 'end': 6.822, 'score': 0.925}, {'word': 'back,', 'start': 6.903, 'end': 7.104, 'score': 1.0}]

Align (after the change):

[{'word': 'I', 'start': 0.031, 'end': 0.373, 'score': 0.92}, {'word': 'focus', 'start': 0.394, 'end': 0.796, 'score': 0.846}, {'word': 'my', 'start': 0.857, 'end': 0.998, 'score': 0.953}, {'word': 'energy,', 'start': 1.159, 'end': 1.562, 'score': 0.831}, {'word': 'aiming', 'start': 1.602, 'end': 2.247, 'score': 0.912}, {'word': 'a', 'start': 2.327, 'end': 2.348, 'score': 0.997}, {'word': 'forceful', 'start': 2.469, 'end': 2.932, 'score': 0.88}, {'word': 'magic', 'start': 2.992, 'end': 3.315, 'score': 0.999}, {'word': 'missile', 'start': 3.395, 'end': 3.717, 'score': 0.83}, {'word': 'at', 'start': 3.778, 'end': 3.838, 'score': 0.865}, {'word': 'the', 'start': 3.858, 'end': 3.939, 'score': 0.826}, {'word': 'remaining', 'start': 4.0, 'end': 4.402, 'score': 0.749}, {'word': 'spectres', 'start': 4.463, 'end': 4.946, 'score': 0.823}, {'word': 'and', 'start': 5.329, 'end': 5.41, 'score': 0.983}, {'word': 'try', 'start': 5.49, 'end': 5.752, 'score': 0.937}, {'word': 'to', 'start': 5.792, 'end': 5.833, 'score': 0.936}, {'word': 'send', 'start': 5.913, 'end': 6.115, 'score': 0.971}, {'word': 'them', 'start': 6.155, 'end': 6.296, 'score': 0.947}, {'word': 'reeling', 'start': 6.336, 'end': 6.84, 'score': 0.941}, {'word': 'back,', 'start': 6.941, 'end': 7.847, 'score': 0.966}]

I cloned the project on the tag v3.3.1 and tested it with and without the fix. I also tried reducing the amount of changes made to alignment.py to a minimum pinpoint the issue. It would seem that the issue happens even when only the changes to the get_trellis and backtrack functions are applied, so it seems the problem lies there. I haven't been able to tell exactly what is causing such a discrepancy.

Minimal changes branch: https://github.com/MarkMLCode/whisperX/tree/minimal-changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions