This is the repository for the paper titled Single and Multi Speaker Cloned Voice Detection: From Perceptual to Learned Features submitted to the 2023 IEEE International Workshop on Information Forensics and Security (WIFS 2023).
The provided source code includes implementations of both the single-speaker and multi-speaker pipelines. However, please note that the dataset used in the experiments is not included in this repository. To replicate the experiments, you would need to create an analogous experimental dataset with cloned voices using different voice cloning architectures or providers.
The repository does provide code for data generation and adversarial laundering, specifically tailored for an example provider called ElevenLabs. You would need to generate features from the analogous dataset and save them to disk. Additionally, you will need to modify the relevant data handling code to ensure compatibility with your new dataset in order to run the pipeline successfully.
Please refer to the repository and the paper for more detailed instructions on how to use the code and conduct the experiments.
The repository is structured as follows:
Folder | File | Description |
---|---|---|
Experiment Pipeline | ||
/src/ |
run_pipeline_ljspeech.py |
Runs the pipeline for single voice (LJSpeech) experiments |
/src/ |
run_pipeline_multivoice.py |
Runs the pipeline for multivoice experiments |
/src/packages/ |
ExperimentPipeline.py |
Class for running the experiment_pipeline and logging results |
/src/packages/ |
ModelManager.py |
Class for managing the final classification models |
Feature Generation | ||
/src/packages/ |
AudioEmbeddingsManager.py |
Class for managing learned features generated using NVIDIA TitaNet |
/src/packages/ |
SmileFeatureManager.py |
Class for managing spectral features generated using openSMILE |
/src/packages/ |
SmileFeatureGenerator.py |
Class for generating spectral features and saving to disk for collections of audio files |
/src/packages/ |
SmileFeatureSelector.py |
Class for selecting spectral features using sklearn.feature_selection |
/src/packages/ |
CadenceModelManager.py |
Class for managing perceptual features generated using handcrafted technqiues |
/src/packages/ |
CadenceUtils.py |
Utility functions used by CadenceModelManager for generating features |
/src/packages/ |
BayesSearch.py |
A class that implements Bayesian Hyperparameter Optimization for perceptual model |
/src/packages/ |
SavedFeatureLoader.py |
Helper function for loading during experiments the generated features saved to disk |
Data Loaders | ||
/src/packages/ |
LJDataLoader.py |
Class for loading and handling the LJSpeech data for experiments |
/src/packages/ |
TIMITDataLoader.py |
Class for loading and handling the TIMIT data for multi-voice experiments |
Data Generation | ||
/src/packages/ |
BaseDeepFakeGenerator.py |
Base class used for processing data used for voice cloning |
/src/packages/ |
ElevenLabsDeepFakeGenerator.py |
Class used to generate deepfakes using the ElevenLabs API |
/src/packages/ |
AudioManager.py |
Class for resampling audio files and performing adversarial laundering |
Misc | ||
. |
README.md |
Provides an overview for the project |
. |
conda_requirements.txt |
Dependencies for creating the conda environment |
. |
pip_requirements.txt |
Dependencies installed with pip |
An overview of the real and synthetic datasets used in our single-speaker (top) and multi-speaker (bottom) evaluations. The 91,700 WaveFake samples correspond to 13,100 samples per each of seven different vocoder architectures, hence the larger number of clips and duration.
Type | Name | Clips (#) | Duration (sec) |
---|---|---|---|
Real | LJSpeech | 13,100 | 86,117 |
Synthetic | WaveFake | 91,700 | 603,081 |
Synthetic | ElevenLabs | 13,077 | 78,441 |
Synthetic | Uberduck | 13,094 | 83,322 |
Type | Name | Clips (#) | Duration (sec) |
---|---|---|---|
Real | TIMIT | 4,620 | 14,192 |
Synthetic | ElevenLabs | 5,499 | 15,413 |
- The LJ Speech 1.1 Dataset -- Data
- WaveFake: A Data Set to Facilitate Audio Deepfake Detection -- Paper, Data
- TIMIT Acoustic-Phonetic Continuous Speech Corpus -- Data
- ElevenLabs (EL) -- https://beta.elevenlabs.io/
- UberDuck (UD) -- https://app.uberduck.ai/
Accuracies for a personalized, single-speaker classification of unlaundered audio (top) and audio subject to adversarial laundering in the form of additive noise and transcoding (bottom). Dataset corresponds to ElevenLabs (EL), UberDuck (UD), and WaveFake (WF); Model corresponds to a linear (L) or non-linear (NL) classifier, and for a single-classifier (real v. synthetic) or multi-classifier (real vs. specific synthethis architecture); accuracy (%) is reported for synthetic audio, real audio, and (for the single-classifiers) equal error rate (EER) is also reported.
Synthetic Accuracy (%) | Real Accuracy (%) | EER (%) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | Model | Learned | Spectral | Perceptual | Learned | Spectral | Perceptual | Learned | Spectral | Perceptual |
Unlaundered | ||||||||||
Binary | ||||||||||
EL | single (L) | 100.0 | 99.2 | 78.2 | 100.0 | 99.9 | 72.5 | 0.0 | 0.5 | 24.9 |
single (NL) | 100.0 | 99.9 | 82.2 | 100.0 | 100.0 | 80.4 | 0.0 | 0.1 | 18.6 | |
UD | single (L) | 99.8 | 98.9 | 51.9 | 99.9 | 98.9 | 54.0 | 0.1 | 1.1 | 47.2 |
single (NL) | 99.7 | 99.2 | 54.4 | 99.9 | 99.0 | 56.5 | 0.2 | 0.9 | 44.5 | |
WF | single (L) | 96.5 | 78.4 | 57.8 | 97.1 | 82.3 | 45.6 | 3.3 | 19.7 | 48.5 |
single (NL) | 94.5 | 87.6 | 50.3 | 96.7 | 90.2 | 52.7 | 4.4 | 11.2 | 48.6 | |
EL+UD | single (L) | 99.7 | 94.8 | 63.4 | 99.9 | 97.1 | 60.3 | 0.2 | 4.2 | 37.9 |
single (NL) | 99.7 | 99.2 | 57.3 | 99.9 | 99.6 | 69.0 | 0.2 | 0.8 | 37.6 | |
EL+UD+WF | single (L) | 93.2 | 79.7 | 58.4 | 98.7 | 93.0 | 57.6 | 3.6 | 15.9 | 42.1 |
single (NL) | 91.2 | 90.6 | 53.1 | 99.0 | 94.1 | 64.7 | 4.1 | 7.9 | 41.6 | |
Multiclass | ||||||||||
EL+UD | multi (L) | 99.9 | 96.6 | 61.0 | 100.0 | 94.6 | 35.7 | - | - | - |
multi (NL) | 99.7 | 98.3 | 65.6 | 100.0 | 97.2 | 43.2 | - | - | - | |
EL+UD+WF | multi (L) | 98.8 | 80.2 | 45.1 | 97.3 | 64.3 | 22.9 | - | - | - |
multi (NL) | 98.1 | 94.2 | 48.6 | 96.3 | 84.4 | 27.6 | - | - | - | |
Laundered | ||||||||||
Binary | ||||||||||
EL | single (L) | 95.5 | 94.3 | 61.1 | 94.5 | 92.6 | 65.2 | 4.9 | 6.7 | 36.6 |
single (NL) | 96.0 | 96.2 | 70.4 | 95.4 | 95.6 | 69.6 | 4.1 | 4.1 | 30.1 | |
UD | single (L) | 95.4 | 81.1 | 61.4 | 91.8 | 84.3 | 44.7 | 6.3 | 17.3 | 46.7 |
single (NL) | 95.4 | 86.8 | 52.9 | 93.3 | 86.1 | 55.9 | 5.5 | 13.6 | 45.6 | |
WF | single (L) | 87.6 | 60.7 | 59.6 | 85.0 | 70.4 | 42.5 | 13.9 | 34.4 | 49.4 |
single (NL) | 83.6 | 77.1 | 51.4 | 85.6 | 76.7 | 53.9 | 15.3 | 23.1 | 47.3 | |
EL+UD | single (L) | 95.2 | 79.1 | 54.0 | 91.7 | 78.4 | 59.8 | 6.2 | 21.3 | 43.1 |
single (NL) | 94.8 | 86.1 | 55.2 | 93.3 | 90.0 | 62.4 | 6.0 | 12.0 | 41.4 | |
EL+UD+WF | single (L) | 83.7 | 70.9 | 50.6 | 88.6 | 72.9 | 59.7 | 13.2 | 28.2 | 44.8 |
single (NL) | 83.4 | 79.2 | 53.0 | 90.7 | 85.1 | 60.7 | 12.5 | 17.9 | 43.6 | |
Multiclass | ||||||||||
EL+UD | multi (L) | 94.2 | 85.6 | 50.9 | 91.0 | 77.1 | 29.1 | - | - | - |
multi (NL) | 94.5 | 91.7 | 53.2 | 90.3 | 82.9 | 41.3 | - | - | - | |
EL+UD+WF | multi (L) | 89.8 | 65.4 | 35.3 | 83.1 | 44.3 | 26.2 | - | - | - |
multi (NL) | 88.8 | 78.8 | 39.8 | 82.1 | 63.0 | 28.6 | - | - | - |
Accuracies for a non-personalized, multi-speaker classification of unlaundered audio. Dataset corresponds to ElevenLabs (EL); Model corresponds to a linear (L) or non-linear (NL) classifier, and for a single-classifier (real v. synthetic) or multi-classifier (real vs. specific synthethis architecture); accuracy (%) is reported for synthetic audio, real audio, and (for the single-classifiers) equal error rate (EER) is also reported.
Synthetic Accuracy (%) | Real Accuracy (%) | EER (%) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | Model | Learned | Spectral | Perceptual | Learned | Spectral | Perceptual | Learned | Spectral | Perceptual |
EL | single (L) | 100.0 | 94.2 | 83.8 | 99.9 | 98.3 | 86.9 | 0.0 | 3.0 | 1.3 |
single (NL) | 92.3 | 96.3 | 82.2 | 100.0 | 99.7 | 87.7 | 0.1 | 1.6 | 1.4 |
- Sarah Barrington1 -- [email protected]
- Romit Barua1 -- [email protected]
- Gautham Koorma1 -- [email protected]
- Hany Farid1,2 -- [email protected]
School of Information1 and Electrical Engineering and Computer Sciences1,2 at the University of California, Berkeley
This work was partially funded by a grant from the UC Berkeley Center For Long-Term Cybersecurity (CLTC), an award for open-source innovation from the Digital Public Goods Alliance and United Nations Development Program, and an unrestricted gift from Meta.
Please cite the following paper if you use this code:
@misc{barrington2023single,
title={Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features},
author={Sarah Barrington and Romit Barua and Gautham Koorma and Hany Farid},
year={2023},
eprint={2307.07683},
archivePrefix={arXiv},
primaryClass={cs.SD}
}