-
Notifications
You must be signed in to change notification settings - Fork 581
Quality Metrics
- General Information
-
Metrics
- [VS other](#Silero VAD vs Other Available Solutions)
- [VS old Silero VAD](#Silero VAD Vs Old Silero VAD)
Dataset | Duration, hours | Domain |
---|---|---|
ESC-50 | 2.7 | Environmental noise |
AliMeeting test | 43 | Far/near meetings speech |
Earnings 21 | 39 | Calls |
MSDWild | 80 | Noisy speech |
AISHELL-4 test | 12.7 | Meetings |
VoxConverse test | 43.5 | Noisy speech |
Libriparty test | 4 | Noisy speech |
Private noise | 0.5 | Noisy calls without speech |
Private speech | 3.7 | Speech |
Multi-Domain Validation | 17 | Multi |
Multi-Domain Validation Set was used for main testing and model comparison. Multi-Domain Validation Set includes 2 random hours (if any) of data from each dataset mentioned above. Total duration - 17 hours.
Modern Voice Activity Detectors output speech probability (a float between 0 and 1) of an audio chunk of a desired length using:
- Some pre-trained model;
- Some function of its state or some internal buffer;
Threshold is a value selected by the user that determines if there is speech in audio. If speech probability of an audio chunk is higher than the set threshold, we assume it has speech. Depending on the desired result threshold should be tailored for a specific data set or domain.
ROC-AUC score
Datasets containing only non-speech data are not included in the ROC-AUC calculations
Model | AliMeeting test | Earnings 21 | MSDWild | AISHELL-4 test | VoxConverse test | Libriparty test | Private speech | Multi-Domain Validation |
---|---|---|---|---|---|---|---|---|
Silero v3 | 0.85 | 0.95 | 0.78 | 0.89 | 0.93 | 0.93 | 0.98 | 0.92 |
Silero v4 | 0.89 | 0.95 | 0.77 | 0.83 | 0.91 | 0.99 | 0.97 | 0.91 |
Unnamed commercial VAD | 0.91 | 0.87 | 0.76 | 0.87 | 0.92 | 0.96 | 0.95 | 0.93 |
Webrtc | 0.82 | 0.86 | 0.62 | 0.74 | 0.65 | 0.79 | 0.86 | 0.73 |
Silero v5 | 0.96 | 0.95 | 0.79 | 0.94 | 0.94 | 0.97 | 0.98 | 0.96 |
A certain threshold is required to calculate model accuracy. The optimal threshold for each model was found by validation on the Multi-Domain Validation Dataset. These thresholds were further used for accuracy calculation on the rest of datasets.
Model | ESC-50 | AliMeeting test | Earnings 21 | MSDWild | AISHELL-4 test | VoxConverse test | Libriparty test | Private noise | Private speech | Multi-Domain Validation |
---|---|---|---|---|---|---|---|---|---|---|
Silero v3 | 0.87 | 0.73 | 0.92 | 0.85 | 0.61 | 0.91 | 0.89 | 0.60 | 0.94 | 0.84 |
Silero v4 | 0.89 | 0.83 | 0.90 | 0.85 | 0.49 | 0.90 | 0.96 | 0.83 | 0.93 | 0.85 |
Unnamed commercial VAD | 0.94 | 0.84 | 0.80 | 0.84 | 0.75 | 0.90 | 0.92 | 0.92 | 0.89 | 0.87 |
Webrtc | 0.38 | 0.82 | 0.89 | 0.83 | 0.57 | 0.84 | 0.80 | 0.84 | 0.86 | 0.74 |
Silero v5 | 0.95 | 0.91 | 0.92 | 0.86 | 0.85 | 0.93 | 0.92 | 0.94 | 0.95 | 0.91 |
Parameters: 16000 Hz
sampling rate.
WebRTC VAD algorithm is extremely fast and pretty good at separating noise from silence, but pretty poor at separating speech from noise.
Unnamed commercial VAD is good overall, but we were able to surpass it in quality (2024).
Parameters: 16000 Hz
sampling rate.
As you can see, there was a huge jump in the model's quality.