Skip to content

Inference: pad very short signals before embedding them #14055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rfejgin
Copy link
Collaborator

@rfejgin rfejgin commented Jun 28, 2025

The speaker embedding model crashes on very short signals. So we zero-pad the end of the signal if it's less than 0.5 seconds long before running it through the speaker embedding model.

The speaker embedding model crashes on very short signals. So we zero-pad the
end of the signal if it's less than 0.5 seconds long before running it through
the speaker embedding model.

Signed-off-by: Fejgin, Roy <[email protected]>
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jun 28, 2025
@rfejgin rfejgin requested a review from paarthneekhara June 28, 2025 00:48
@rfejgin rfejgin marked this pull request as ready for review June 28, 2025 00:48
@rfejgin rfejgin requested a review from shehzeen June 28, 2025 00:49
@rfejgin rfejgin enabled auto-merge (squash) June 28, 2025 00:51
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jun 29, 2025
Copy link
Collaborator

@subhankar-ghosh subhankar-ghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some minor comments.

def extract_embedding(model, extractor, audio_path, device, sv_model_type):
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# pad to 0.5 seconds as the extractor may not be able to handle very short signals
speech_array = pad_audio_to_min_length(speech_array, int(sampling_rate), min_seconds=0.5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested, does this affect the final evaluation metrics in any way?

Copy link
Collaborator Author

@rfejgin rfejgin Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that the padding works correctly. Did not collect pre/post evaluation stats. This should only kick in very rarely, when the generated speech is shorter than 0.5 sec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this: I ran libri_unseen_test with and without the padding fix and found no statistically significant differences in WER and SSIM.

@rfejgin rfejgin merged commit dbc6e78 into NVIDIA:magpietts_2503 Jul 1, 2025
73 checks passed
@rfejgin
Copy link
Collaborator Author

rfejgin commented Jul 1, 2025

LGTM. Left some minor comments.

Thanks @subhankar-ghosh. It had been set to auto-merge so already got merged as soon as you approved, but I'll still look at your comments.

Copy link
Collaborator

@paarthneekhara paarthneekhara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants