Skip to content

griko/vanpy

Repository files navigation

VANPY

VANPY (Voice Analysis framework in Python) is a flexible and extensible framework for voice analysis, feature extraction, and model inference. It provides a modular pipeline architecture for processing audio segments with near- and state-of-the-art deep learning models.

VANPY

Architecture

VANPY consists of three optional pipelines that can be used independently or in combination:

  1. Preprocessing Pipeline: Handles audio format conversion and voice segment extraction
  2. Feature Extraction Pipeline: Generates feature/latent vectors from voice segments
  3. Model Inference Pipeline

You can use these pipelines flexibly based on your needs:

  • Use only preprocessing for voice separation
  • Combine preprocessing and classification for direct audio analysis
  • Use all pipelines for complete feature extraction and classification

Models Trained as part of the VANPY project

Task Dataset Performance
Gender Identification (Accuracy) VoxCeleb2 98.9%
Mozilla Common Voice v10.0 92.3%
TIMIT 99.6%
Emotion Recognition (Accuracy) RAVDESS (8-class) 84.71%
RAVDESS (7-class) 86.24%
Age Estimation (MAE in years) VoxCeleb2 7.88
TIMIT 4.95
Combined VoxCeleb2-TIMIT 6.93
Height Estimation (MAE in cm) VoxCeleb2 6.01
TIMIT 6.02

All of the models can be used as a part of the VANPY pipeline or separately and are available on 🤗HuggingFace

Configuration

Environment Setup

  1. Create a pipeline.yaml configuration file. You can use the src/pipeline.yaml as a template.
  2. For HuggingFace models (Pyannote components), create a .env file:
huggingface_ACCESS_TOKEN=<your_token>
  1. Pipelines examples are available in src/run.py.

Components

Each component expects as an input and returns as an output a ComponentPayload object.

Each component supports:

  • Batch processing (if applicable)
  • Progress tracking
  • Performance monitoring and logging
  • Incremental processing (skip already processed files)
  • GPU acceleration where applicable
  • Configurable parameters

Preprocessing Components

Component Description
Filelist-DataFrame Creator Initializes data pipeline by creating a DataFrame of audio file paths. Supports both directory scanning and loading from existing CSV files. Manages path metadata for downstream components.
WAV Converter Standardizes audio format to WAV with configurable parameters including bit rate (default: 256k), channels (default: mono), sample rate (default: 16kHz), and codec (default: PCM 16-bit). Uses FFMPEG for robust conversion.
WAV Splitter Handles large audio files by splitting them into manageable segments based on either duration or file size limits. Maintains audio quality and creates properly labeled segments with original file references.
INA Voice Separator Separates audio into voice and non-voice segments, distinguishing between male and female speakers. Filters out non-speech content while preserving speaker gender information.
Pyannote VAD Performs Voice Activity Detection using Pyannote's state-of-the-art deep learning model. Identifies and extracts speech segments with configurable sensitivity.
Silero VAD Alternative Voice Activity Detection using Silero's efficient model. Optimized for real-time performance with customizable parameters.
Pyannote SD Speaker Diarization component that identifies and separates different speakers in audio. Creates individual segments for each speaker with timing information. Supports overlapping speech handling.
MetricGAN SE Speech Enhancement using MetricGAN+ model from SpeechBrain. Reduces background noise and improves speech clarity.
SepFormer SE Speech Enhancement using SepFormer model, specialized in separating speech from complex background noise.

Feature Extraction Components

Component Description
Librosa Features Extractor Comprehensive audio feature extraction using the Librosa library. Supports multiple feature types including: MFCC (Mel-frequency cepstral coefficients), Delta-MFCC, zero-crossing rate, spectral features (centroid, bandwidth, contrast, flatness), fundamental frequency (F0), and tonnetz.
Pyannote Embedding Generates speaker embeddings using Pyannote's deep learning models. Uses sliding window analysis with configurable duration and step size. Outputs high-dimensional embeddings optimized for speaker differentiation.
SpeechBrain Embedding Extracts neural embeddings using SpeechBrain's pretrained models, particularly the ECAPA-TDNN architecture (default: spkrec-ecapa-voxceleb).

Model Inference Components

Component Description
VanpyGender Classifier SVM-based binary gender classification using speech embeddings. Supports two models: ECAPA-TDNN (192-dim) and XVECT (512-dim) embeddings from SpeechBrain. Trained on VoxCeleb2 dataset with optimized hyperparameters. Provides both verbal ('female'/'male') and numeric label options.
VanpyAge Regressor Multi-architecture age estimation supporting SVR and ANN models. Features multiple variants: pure SpeechBrain embeddings (192-dim), combined SpeechBrain and Librosa features (233-dim), and dataset-specific models (VoxCeleb2/TIMIT).
VanpyEmotion Classifier 7-class SVM emotion classifier trained on RAVDESS dataset using SpeechBrain embeddings. Classifies emotions into: angry, disgust, fearful, happy, neutral/calm, sad, surprised.
IEMOCAP Emotion SpeechBrain-based emotion classifier trained on the IEMOCAP dataset. Uses Wav2Vec2 for feature extraction. Supports four emotion classes: angry, happy, neutral, sad.
Wav2Vec2 ADV Advanced emotion analysis using Wav2Vec2, providing continuous scores for arousal, dominance, and valence dimensions.
Wav2Vec2 STT Speech-to-text transcription using Facebook's Wav2Vec2 model.
Whisper STT OpenAI's Whisper model for robust speech recognition. Supports multiple model sizes and languages. Includes automatic language detection.
Cosine Distance Clusterer a Clustering method that can be used for speaker diarization using cosine similarity metrics. Groups speech segments by speaker identity using embedding similarity.
GMM Clusterer Gaussian Mixture Model-based speaker clustering.
Agglomerative Clusterer Hierarchical clustering for speaker diarization. Uses distance-based merging with configurable threshold and maximum clusters.
YAMNet Classifier Google's YAMNet model for general audio classification. Supports 521 audio classes from AudioSet ontology.

ComponentPayload Structure

The ComponentPayload class manages data flow between pipeline components:

class ComponentPayload:
    metadata: Dict  # Pipeline metadata
    df: pd.DataFrame  # Processing results

Metadata fields

  • input_path: Path to the input directory (required for FilelistDataFrameCreator if no df is provided)
  • paths_column: Column name for audio file paths
  • all_paths_columns: List of all path columns
  • feature_columns: List of feature columns
  • meta_columns: List of metadata columns
  • classification_columns: List of classification columns

df fields

  • df: pd.DataFrame

    Includes all the collected information through the preprocessing and classification

    • each preprocessor adds a column of paths where the processed files are hold
    • embedding/feature extraction components add the embedding/features columns
    • each model adds a model-results column

Key Methods

  • get_features_df(): Extract features DataFrame
  • get_classification_df(): Extract classification results DataFrame

Coming Soon

  • Custom classifier integration guide
  • Additional preprocessing components
  • Extended model support
  • Newer python and dependencies version support

Citing VANPY

Please, cite VANPY if you use it

@misc{koushnir2025vanpyvoiceanalysisframework,
      title={VANPY: Voice Analysis Framework}, 
      author={Gregory Koushnir and Michael Fire and Galit Fuhrmann Alpert and Dima Kagan},
      year={2025},
      eprint={2502.17579},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2502.17579}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published