VANPY (Voice Analysis framework in Python) is a flexible and extensible framework for voice analysis, feature extraction, and model inference. It provides a modular pipeline architecture for processing audio segments with near- and state-of-the-art deep learning models.
VANPY consists of three optional pipelines that can be used independently or in combination:
- Preprocessing Pipeline: Handles audio format conversion and voice segment extraction
- Feature Extraction Pipeline: Generates feature/latent vectors from voice segments
- Model Inference Pipeline
You can use these pipelines flexibly based on your needs:
- Use only preprocessing for voice separation
- Combine preprocessing and classification for direct audio analysis
- Use all pipelines for complete feature extraction and classification
Task | Dataset | Performance |
---|---|---|
Gender Identification (Accuracy) | VoxCeleb2 | 98.9% |
Mozilla Common Voice v10.0 | 92.3% | |
TIMIT | 99.6% | |
Emotion Recognition (Accuracy) | RAVDESS (8-class) | 84.71% |
RAVDESS (7-class) | 86.24% | |
Age Estimation (MAE in years) | VoxCeleb2 | 7.88 |
TIMIT | 4.95 | |
Combined VoxCeleb2-TIMIT | 6.93 | |
Height Estimation (MAE in cm) | VoxCeleb2 | 6.01 |
TIMIT | 6.02 |
All of the models can be used as a part of the VANPY pipeline or separately and are available on 🤗HuggingFace
- Create a
pipeline.yaml
configuration file. You can use thesrc/pipeline.yaml
as a template. - For HuggingFace models (Pyannote components), create a
.env
file:
huggingface_ACCESS_TOKEN=<your_token>
- Pipelines examples are available in
src/run.py
.
Each component expects as an input and returns as an output a ComponentPayload
object.
Each component supports:
- Batch processing (if applicable)
- Progress tracking
- Performance monitoring and logging
- Incremental processing (skip already processed files)
- GPU acceleration where applicable
- Configurable parameters
Component | Description |
---|---|
Filelist-DataFrame Creator | Initializes data pipeline by creating a DataFrame of audio file paths. Supports both directory scanning and loading from existing CSV files. Manages path metadata for downstream components. |
WAV Converter | Standardizes audio format to WAV with configurable parameters including bit rate (default: 256k), channels (default: mono), sample rate (default: 16kHz), and codec (default: PCM 16-bit). Uses FFMPEG for robust conversion. |
WAV Splitter | Handles large audio files by splitting them into manageable segments based on either duration or file size limits. Maintains audio quality and creates properly labeled segments with original file references. |
INA Voice Separator | Separates audio into voice and non-voice segments, distinguishing between male and female speakers. Filters out non-speech content while preserving speaker gender information. |
Pyannote VAD | Performs Voice Activity Detection using Pyannote's state-of-the-art deep learning model. Identifies and extracts speech segments with configurable sensitivity. |
Silero VAD | Alternative Voice Activity Detection using Silero's efficient model. Optimized for real-time performance with customizable parameters. |
Pyannote SD | Speaker Diarization component that identifies and separates different speakers in audio. Creates individual segments for each speaker with timing information. Supports overlapping speech handling. |
MetricGAN SE | Speech Enhancement using MetricGAN+ model from SpeechBrain. Reduces background noise and improves speech clarity. |
SepFormer SE | Speech Enhancement using SepFormer model, specialized in separating speech from complex background noise. |
Component | Description |
---|---|
Librosa Features Extractor | Comprehensive audio feature extraction using the Librosa library. Supports multiple feature types including: MFCC (Mel-frequency cepstral coefficients), Delta-MFCC, zero-crossing rate, spectral features (centroid, bandwidth, contrast, flatness), fundamental frequency (F0), and tonnetz. |
Pyannote Embedding | Generates speaker embeddings using Pyannote's deep learning models. Uses sliding window analysis with configurable duration and step size. Outputs high-dimensional embeddings optimized for speaker differentiation. |
SpeechBrain Embedding | Extracts neural embeddings using SpeechBrain's pretrained models, particularly the ECAPA-TDNN architecture (default: spkrec-ecapa-voxceleb). |
Component | Description |
---|---|
VanpyGender Classifier | SVM-based binary gender classification using speech embeddings. Supports two models: ECAPA-TDNN (192-dim) and XVECT (512-dim) embeddings from SpeechBrain. Trained on VoxCeleb2 dataset with optimized hyperparameters. Provides both verbal ('female'/'male') and numeric label options. |
VanpyAge Regressor | Multi-architecture age estimation supporting SVR and ANN models. Features multiple variants: pure SpeechBrain embeddings (192-dim), combined SpeechBrain and Librosa features (233-dim), and dataset-specific models (VoxCeleb2/TIMIT). |
VanpyEmotion Classifier | 7-class SVM emotion classifier trained on RAVDESS dataset using SpeechBrain embeddings. Classifies emotions into: angry, disgust, fearful, happy, neutral/calm, sad, surprised. |
IEMOCAP Emotion | SpeechBrain-based emotion classifier trained on the IEMOCAP dataset. Uses Wav2Vec2 for feature extraction. Supports four emotion classes: angry, happy, neutral, sad. |
Wav2Vec2 ADV | Advanced emotion analysis using Wav2Vec2, providing continuous scores for arousal, dominance, and valence dimensions. |
Wav2Vec2 STT | Speech-to-text transcription using Facebook's Wav2Vec2 model. |
Whisper STT | OpenAI's Whisper model for robust speech recognition. Supports multiple model sizes and languages. Includes automatic language detection. |
Cosine Distance Clusterer | a Clustering method that can be used for speaker diarization using cosine similarity metrics. Groups speech segments by speaker identity using embedding similarity. |
GMM Clusterer | Gaussian Mixture Model-based speaker clustering. |
Agglomerative Clusterer | Hierarchical clustering for speaker diarization. Uses distance-based merging with configurable threshold and maximum clusters. |
YAMNet Classifier | Google's YAMNet model for general audio classification. Supports 521 audio classes from AudioSet ontology. |
The ComponentPayload
class manages data flow between pipeline components:
class ComponentPayload:
metadata: Dict # Pipeline metadata
df: pd.DataFrame # Processing results
input_path
: Path to the input directory (required forFilelistDataFrameCreator
if nodf
is provided)paths_column
: Column name for audio file pathsall_paths_columns
: List of all path columnsfeature_columns
: List of feature columnsmeta_columns
: List of metadata columnsclassification_columns
: List of classification columns
-
df
: pd.DataFrameIncludes all the collected information through the preprocessing and classification
- each preprocessor adds a column of paths where the processed files are hold
- embedding/feature extraction components add the embedding/features columns
- each model adds a model-results column
get_features_df()
: Extract features DataFrameget_classification_df()
: Extract classification results DataFrame
- Custom classifier integration guide
- Additional preprocessing components
- Extended model support
- Newer python and dependencies version support
Please, cite VANPY if you use it
@misc{koushnir2025vanpyvoiceanalysisframework,
title={VANPY: Voice Analysis Framework},
author={Gregory Koushnir and Michael Fire and Galit Fuhrmann Alpert and Dima Kagan},
year={2025},
eprint={2502.17579},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.17579},
}