-
Notifications
You must be signed in to change notification settings - Fork 3k
Inference: pad very short signals before embedding them #14055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference: pad very short signals before embedding them #14055
Conversation
The speaker embedding model crashes on very short signals. So we zero-pad the end of the signal if it's less than 0.5 seconds long before running it through the speaker embedding model. Signed-off-by: Fejgin, Roy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some minor comments.
def extract_embedding(model, extractor, audio_path, device, sv_model_type): | ||
speech_array, sampling_rate = librosa.load(audio_path, sr=16000) | ||
|
||
# pad to 0.5 seconds as the extractor may not be able to handle very short signals | ||
speech_array = pad_audio_to_min_length(speech_array, int(sampling_rate), min_seconds=0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested, does this affect the final evaluation metrics in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that the padding works correctly. Did not collect pre/post evaluation stats. This should only kick in very rarely, when the generated speech is shorter than 0.5 sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on this: I ran libri_unseen_test
with and without the padding fix and found no statistically significant differences in WER and SSIM.
Thanks @subhankar-ghosh. It had been set to auto-merge so already got merged as soon as you approved, but I'll still look at your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
The speaker embedding model crashes on very short signals. So we zero-pad the end of the signal if it's less than 0.5 seconds long before running it through the speaker embedding model.