Skip to content

Quality Metrics

Dimitrii Voronin edited this page Dec 6, 2021 · 14 revisions

Silero VAD quality metrics

Info

Test Dataset

Some Test Dataset stats:

  • 2200 audio with an average duration of 7 seconds.
  • 30+ languages
  • different speech audibility (from clear studio to background speech)
  • different noise levels
  • manually annotated whole audio (is there a speech), not short chunks
  • 55% of test audio is speech.

Probability

Modern Voice Activity Detectors output speech probability (float between 0 and 1) of an audio chunk of a desired length using:

  • sound external features
  • parameters obtained during training
  • previous chunk state

Threshold

Threshold is the user selectable value that distinguishes speech in audio. If speech probability of an audio chunk is higher than the set threshold, we assume it is speech. Depending on the desired result threshold should be tailored for a specific data set.

Method

We assume that certain VAD algorithm predicted speech for the whole audio if its chunk predictions contain a sequence of probabilities above a certain threshold and longer than 250 milliseconds

So test method can be described as follows:

  • Get raw model predictions (sequence of speech probabilities between 0 and 1) for each audio in the test set.
  • Use raw predictions with different thresholds to calculate if there is a speech in the whole audio
  • Calculate recall/precision/accuracy/zero class recall for each threshold
  • Draw Precision-Recall curve

Metrics

Vs other available solutions

Parameters: 16000 sample rate, 30 milliseconds (512 samples).

WebRTC VAD algorithm is extremely fast and pretty good at separating noise from silence, but poor at separating speech from noise.

Picovoice VAD is good overall, but we were able to surpass it in quality.

Vs_competitors

Vs old Silero VAD

Parameters: 16000 sample rate, 100 milliseconds (1536 samples) for new VAD models, 250 milliseconds (4000 samples) for the old one.

As you can see, there was a huge jump in the model's quality. Silero Big model is not publicly available, please contact us if you are interested in it.

Vs_old

Sample rate comparison

Diff_sr

Chunk size comparison

Diff_chunk_size

Clone this wiki locally