-
Notifications
You must be signed in to change notification settings - Fork 581
Quality Metrics
Some Test Dataset stats:
- 2200 audio with an average duration of 7 seconds.
- 30+ languages
- different speech audibility (from clear studio to background speech)
- different noise levels
- manually annotated whole audio (is there a speech), not short chunks
- 55% of test audio is speech.
Modern Voice Activity Detectors output speech probability (float between 0 and 1) of an audio chunk of a desired length using:
- sound external features
- parameters obtained during training
- previous chunk state
Threshold is the user selectable value that distinguishes speech in audio. If speech probability of an audio chunk is higher than the set threshold, we assume it is speech. Depending on the desired result threshold should be tailored for a specific data set.
We assume that certain VAD algorithm predicted speech for the whole audio if its chunk predictions contain a sequence of probabilities above a certain threshold and longer than 250 milliseconds
So test method can be described as follows:
- Get raw model predictions (sequence of speech probabilities between 0 and 1) for each audio in the test set.
- Use raw predictions with different thresholds to calculate if there is a speech in the whole audio
- Calculate recall/precision/accuracy/zero class recall for each threshold
- Draw Precision-Recall curve
Parameters: 16000 sample rate
, 30 milliseconds (512 samples)
.
WebRTC VAD algorithm is extremely fast and pretty good at separating noise from silence, but poor at separating speech from noise.
Picovoice VAD is good overall, but we were able to surpass it in quality.
Parameters: 16000 sample rate
, 100 milliseconds (1536 samples) for new VAD models, 250 milliseconds (4000 samples) for the old one
.
As you can see, there was a huge jump in the model's quality. Silero Big model is not publicly available, please contact us if you are interested in it.