Skip to content

This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker height from audio input

License

Notifications You must be signed in to change notification settings

griko/voice-height-regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Height Estimation Model

This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker height from audio input. The model was trained on the VoxCeleb2 and evaluated on the VoxCeleb2 and TIMIT datasets.

Model Details

  • Architecture: SpeechBrain ECAPA-TDNN embeddings (192-dim) + SVR regressor
    • Output: Predicted height in centimeters (continuous value)
  • Training Data:
    • The height data was gained by querying the height parameter of VoxCeleb1 in conjunction with VoxCeleb2 from Wikidata and converted it to centimeters.
    • It contains 1715 persons with height information for both datasets (VoxCeleb1 and VoxCeleb2), 1621 of which are present in VoxCeleb2.
    • The code and data can be found in src\voxceleb_height_data_collection.
    • The original VOXCELEB ENRICHMENT FOR AGE AND GENDER RECOGNITION dataset can be found here.
  • Performance:
    • VoxCeleb2 test set: 6.01 cm Mean Absolute Error (MAE)
    • TIMIT test set: 6.02 cm Mean Absolute Error (MAE)
  • Audio Processing:
    • Input format: Any audio file format supported by soundfile
    • Automatically converted to: 16kHz, mono, single channel, 256 Kbps

Installation

You can install the package directly from GitHub:

pip install git+https://github.com/griko/voice-height-regression.git

Usage

from voice_height_regression import HeightRegressionPipeline

# Load the pipeline
regressor = HeightRegressionPipeline.from_pretrained(
    "griko/height_reg_svr_ecapa_voxceleb"
)

# Single file prediction
result = regressor("path/to/audio.wav")
print(f"Predicted height: {result[0]:.1f} cm")

# Batch prediction
results = regressor(["audio1.wav", "audio2.wav"])
print(f"Predicted heights: {[f'{h:.1f}' for h in results]} cm")

Limitations

  • Model was trained on celebrity voices from YouTube interviews
  • Performance may vary on:
    • Different audio qualities
    • Different recording conditions
    • Multiple simultaneous speakers

Citation

If you use this model in your research, please cite:

TBD

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

  • VoxCeleb2 dataset for providing the training data
  • SpeechBrain team for their excellent speech processing toolkit

About

This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker height from audio input

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages