- Voice Activity Detection / Speech Activity Detection
Real-time versus offline.
https://www.sciencedirect.com/science/article/abs/pii/S0885230813000533 NIST Compares with Statistical Model (SM) based VAD and Gaussian Mixture Model (GMM) based VAD
(1) noise reduction is vital for energy-based VAD under low SNR; (3) spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD
Datasets
Can use any dataset that has marked time regions for speech. Important is how noisy the datasets are, ie how much non-speech content exists. Like for example music.
Code exists inside the Chromium tree, under webrt/modules/audio_processing/vad
A fork has been created that separates out just the VAD as a C library. https://github.com/dpirch/libfvad
Available for Python in py-webrtcvad. Expects int16 input of the audio waveform. Needs some massaging to work with typical floating point input. TODO, link
The WebRTC VAD model is described in 2.3 of Voice Activity Detection Scheme by Combining DNN Model with GMM Model. Is based on a two-mixture Gaussian Mixture Model (noise and speech). The input features are 6 frequency sub-bands. Does online updating of the GMM, with different update coefficients for the GMM coefficients, such that noise would increase slowly, but decrease quickly. With the assumption that the sound is mostly noise, and occationally speech segments.
Works well at speech vs silence. But does poorly with music or other noise, will often get flagged as speech.
CNN-based audio segmentation toolkit. Primarily for performing gender detection (tuned for French), but has also a model that distinguishes between music, speech and noise. Uses a 4 layer CNN per frame. Per frame predictions are then combined using a Hidden Markov Model. Has command-line tools for processing a whole batch.
https://github.com/ina-foss/inaSpeechSegmenter
The Opus audio codec includes a Voice / Music detector, which is used in the encoder to switch between speech optimized encoding (SILK) and music/general encoding (CELT). In Opus v1.1 it used a multilayer perceptron followed by a Hidden Markov Model. Since Opus v1.3 (October 2018) it is based on a RNN using Gated Recurrent Units. The RNN has around 5000 weights, with 8 bit quantization. This network has two outputs, one for Voice and one for Music. So a low output on both
The opus_sm fork has a commandline tool of the old Opus 1.1 sound/music detector. No-one has described how to us the Opus 1.3 detector in another application? One question on mailing list.
Voice activity detection for low-resource settings. Abhipray Sahoo, Stanford CS230. Used a 3 layer RNN on log mel spectrogram with 40 bands, and using 32 ms windows. Trained and evaluated on a dataset made by combining VCTK with Noisex-92. Large improvements against WebRTC VAD under noisy conditions.
Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier. Near 100 % acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100 % rejection of stationary noises independently of loudness. Reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2 KB memory and 15 % complexity overhead.
Personal VAD: Speaker-Conditioned Voice Activity Detection Personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Combined with Speaker Verification System. No online / real-time adaptation to speakers. Seems to need to be pre-trained for a particular speaker.
https://github.com/snakers4/silero-vad
Released as pretrained PyTorch models. Training setup not released. Train/test dataset not released. Does not specify which PyTorch versions to use. Not installable as a pip library. Latest models ("big") not publically available. Has some nice examples of real-time.
Aka distance-based-segmentation.
First segmentation, then classification.
Distance metrics
-
Euclidean distance
-
Bayesian information criterion (BIC)
-
Gaussian likelihood ratio (GLR)
-
Kullback Leibler KL2 distance
-
Hotelling T2 statistic
-
cosine distance (with self-similarity analysis)
-
Learned distances (using neural networks)
-
Hidden Markov Model (HMM)
Segmentation performed by classifying consecutive fixed-length segments.
Common with both kind of approaches. Can be done with HMM. Or heuristics such as removing segments below certain length.
- Albayzín 2010
- Albayzín 2012
- MIREX 2018
MaxEnt / Maximum Entropy Classifier. Same as Logistic Regressions. Or conditional exponential classifier
https://www.programmersought.com/article/81806889517/
How to use HMM model for audio segmentation https://stackoverflow.com/questions/66817625/how-to-use-hmm-model-for-audio-segmentation No answers
https://hal.archives-ouvertes.fr/hal-00957418/document HMM has a fixed structure, decomposing into silent,transient,tonal,residual
Joint Segmentation and Classification https://www.sciencedirect.com/book/9780080993881/introduction-to-audio-analysis https://www.sciencedirect.com/topics/engineering/audio-segmentation Dynamic programming, Viterbi
An Overview of Automatic Audio Segmentation October 2014 https://www.researchgate.net/publication/287718330_An_Overview_of_Automatic_Audio_Segmentation
Optimized Audio Classification and Segmentation Algorithm by Using Ensemble Methods https://www.hindawi.com/journals/mpe/2015/209814/ 2015
https://asmp-eurasipjournals.springeropen.com/articles/10.1186/s13636-020-00172-6 EURASIP 2020 RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models Explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate
Improving automated segmentation of radio shows with audio embeddings https://arxiv.org/abs/2002.05194 ICASSP 2020