Quality Metrics

Silero VAD quality metrics

General Information
Metrics
- [VS other](#Silero VAD vs Other Available Solutions)
- [VS old Silero VAD](#Silero VAD Vs Old Silero VAD)

Info

Test Datasets

Dataset	Duration, hours	Domain
ESC-50	2.7	Environmental noise
AliMeeting test	43	Far/near meetings speech
Earnings 21	39	Calls
MSDWild	80	Noisy speech
AISHELL-4 test	12.7	Meetings
VoxConverse test	43.5	Noisy speech
Libriparty test	4	Noisy speech
Private noise	0.5	Noisy calls without speech
Private speech	3.7	Speech
Multi-Domain Validation	17	Multi

Multi-Domain Validation Set was used for main testing and model comparison. Multi-Domain Validation Set includes 2 random hours (if any) of data from each dataset mentioned above. Total duration - 17 hours.

Probability

Modern Voice Activity Detectors output speech probability (a float between 0 and 1) of an audio chunk of a desired length using:

Some pre-trained model;

Some function of its state or some internal buffer;

Threshold

Threshold is a value selected by the user that determines if there is speech in audio. If speech probability of an audio chunk is higher than the set threshold, we assume it has speech. Depending on the desired result threshold should be tailored for a specific data set or domain.

Metrics

ROC-AUC score

Datasets containing only non-speech data are not included in the ROC-AUC calculations

Model	AliMeeting test	Earnings 21	MSDWild	AISHELL-4 test	VoxConverse test	Libriparty test	Private speech	Multi-Domain Validation
Silero v3	0.85	0.95	0.78	0.89	0.93	0.93	0.98	0.92
Silero v4	0.89	0.95	0.77	0.83	0.91	0.99	0.97	0.91
Unnamed commercial VAD	0.91	0.87	0.76	0.87	0.92	0.96	0.95	0.93
Webrtc	0.82	0.86	0.62	0.74	0.65	0.79	0.86	0.73
Silero v5	0.96	0.95	0.79	0.94	0.94	0.97	0.98	0.96

Accuracy score

A certain threshold is required to calculate model accuracy. The optimal threshold for each model was found by validation on the Multi-Domain Validation Dataset. These thresholds were further used for accuracy calculation on the rest of datasets.

Model	ESC-50	AliMeeting test	Earnings 21	MSDWild	AISHELL-4 test	VoxConverse test	Libriparty test	Private noise	Private speech	Multi-Domain Validation
Silero v3	0.87	0.73	0.92	0.85	0.61	0.91	0.89	0.60	0.94	0.84
Silero v4	0.89	0.83	0.90	0.85	0.49	0.90	0.96	0.83	0.93	0.85
Unnamed commercial VAD	0.94	0.84	0.80	0.84	0.75	0.90	0.92	0.92	0.89	0.87
Webrtc	0.38	0.82	0.89	0.83	0.57	0.84	0.80	0.84	0.86	0.74
Silero v5	0.95	0.91	0.92	0.86	0.85	0.93	0.92	0.94	0.95	0.91

Silero VAD vs Other Available Solutions

Parameters: 16000 Hz sampling rate.

WebRTC VAD algorithm is extremely fast and pretty good at separating noise from silence, but pretty poor at separating speech from noise.

Unnamed commercial VAD is good overall, but we were able to surpass it in quality (2024).

Silero VAD Vs Old Silero VAD

Parameters: 16000 Hz sampling rate.

As you can see, there was a huge jump in the model's quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quality Metrics

Silero VAD quality metrics

Info

Test Datasets

Probability

Threshold

Metrics

ROC-AUC score

Accuracy score

Silero VAD vs Other Available Solutions

Silero VAD Vs Old Silero VAD

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally