A curated, but probably biased and incomplete, list of awesome Deepfakes resources specially selected for audios. Even though this is a selected collection that cannot cover all relevant materials, we are trying to make this repo more self-contained.
This collection is enlightened by the work of a general deepfake material collection from (https://github.com/datamllab/awesome-deepfakes-materials). As both deep learning and deepfakes are evolving fast these days, feel like it would be somehow helpful to create a separate repo here to collect materials on Deepfake audios. If you want to contribute to this list, feel free to pull a request.
In December 2017, a user with name “DeepFakes” posted realistic looking videos of famous celebrities on Reddit. These fake videos are generated using deep learning, by swapping faces of original adult movies with celebrities’ faces. Since then, the topic of DeepFakes goes viral on internet.
Here, we denote DeepFakes as any fake contents generated by deep learning techniques. DeepFakes comes in different forms, perhaps the most typical ones are: 1) Videos and images, 2) Texts, and 3) Voices. Different deep learning techniques are used to generate DeepFakes. For instance, videos and images are usually created by Generative adversarial networks (GAN), while texts are mostly generated by deep language models based on Transformers.
It is worth pointing out the first type of videos and images are most well known deepfakes. This is relevant to the current high spread speed of videos and images on social media and multimedia websites like Youtube. This repo aims to collect materials on the threat of fake audios. Just as Eric Haller, Global EVP of Identity, Fraud, and DataLabs at Experian points out these days, your voice footprints have become a siginificant part of your digital identity( https://www.youtube.com/watch?v=dqhi930ksK8).
Voice Verification: the biometric technology that allows you to use your unique voiceprint to access your accounts when you access your services like bank accounts. Voice Verification is simple because it makes your voice your password (https://www.wellsfargo.com/privacy-security/voice-verification/). Verification outputs a binary result of whether the voice belongs to the speaker of not. If you need to know exactly who the speaker is, that would refer to the concept of 'voice identification'.
Voice Conversion: modify the speech of a source speaker and output speech sound like that of another target speaker without changing the original linguistic information. Oh you can simply regard it as a style transfer on audio.
Voice Clone: it is interchangeable with the above word of 'voice conversion' in many cases. But you can also more strictly say it is an area in voice conversion that simply aiming to copy a target's voice without merging it with a source voice.
For the task of fake audio generation especially deepfake audio generation, there are three main type of models involved (definitions are extracted from this paper Who Are You (I Really Wanna Know)? Detecting Audio {DeepFakes} Through Vocal Tract Reconstruction):
(1) Encoder: The encoder learns the unique representation of the speaker’s voice, known as the speaker embedding. The embedding is derived from a short utterance using the target speaker’s voice. The accuracy of the embedding can be increased by giving the encoder more utterances, with diminishing returns. The output embedding from the encoder stage can be then passed as an input into the following synthesizer stage.
(2) Synthesizer: A synthesizer generates a Mel Spectrogram from a given text and the speaker embedding. A Mel Spectrogram is a spectrogram that has its frequencies scaled using the Mel scale, which is designed to model audio perception of the human ear. Some synthesizers can produce spectrograms solely from a sequence of characters or phonemes like Tacotron.
(3) Vocoder: The vocoder converts the Mel Spectrogram to retrieve the corresponding audio waveform. This newly generated audio waveform will ideally sound like a target individual uttering a specific sentence. A commonly used vocoder model is some variation of WaveNet, which uses a deep convolutional neural network that uses surrounding contextual information to generate its waveform.
There is a good figure that illustrates two technique paths for deepfake voice clone generation from paper Neural Voice Cloning with a Few Samples. We attach it here to further explain this process.
From this same paper, we get ideas of these two paths:
(1) The idea of the speaker adaptation is to fine-tune a trained multi-speaker model for an unseen speaker using a few audio-text pairs. Fine-tuning can be applied to either the speaker embedding or the whole model.
(2) The idea of the speaker encoding method is to directly estimate the speaker embedding from audio samples of an unseen speaker. Such a model does not require any fine-tuning during voice cloning. Thus, the same model can be used for all unseen speakers.
Typically, speaker adaptation can achieve better performance than speaker encoding especially when more data samples can be used for the model tuning step. But the computation cost of speaker encoding is much lower because it does not need any further model tuning.
- Deepfake: The Good, The Bad and the Ugly - Medium
- DeepFakes: AI-powered deception machines
- Exploring DeepFakes - KDnuggets
- Family fun with deepfakes. Or how I got my wife onto the Tonight Show - Towards Data Science
- Deepfake propaganda is not a real problem - The Verge
Deepfake Voices:
- You can now speak using someone else’s voice with Deep Learning
- A Voice Deepfake Was Used To Scam A CEO Out Of $243,000
- Voice cloning technology may have been used in Pearland scam
Even though we are focused on DeepFake audio attacks, it is worth pointing out audio based attacks can be more flexible. Typical types of audio attacks (definition extracted from ASVProof Challenge):
- Logical Access (LA): bona fide and spoofed utterances generated using text-to-speech (TTS) and voice conversion (VC) algorithms are communicated across telephony and VoIP networks with various coding and transmission effects;
- Physical Access (PA): bona fide utterances are made in a real, physical space in which spoofing attacks are captured and then replayed within the same physical space using replay devices of varying quality;
- Speech Deepfake (DF): a fake audio detection task comprising bona fide and spoofed utterances generated using TTS and VC algorithms. Similar to the LA task (includes compressed data) but without speaker verification.
The LA and PA tasks are based on classical setting of automatic speaker verification (ASV). In AVSProof contests, the metric for these two tasks will be the minimum tandem decision cost function (min t-DCF details can be in the doc link above) . The new DF task has a fake media / fake audio / deepfake flavour in which there is no ASV system. The metric for the DF condition will revert to the equal error rate (EER). EER is the point when the false acceptance rate and false rejection rate are equal.
pronounciation of numbers
Papers:
- Tacotron: Towards End-to-End Speech Synthesis. [Paper]
- Neural Voice Cloning with a Few Samples. [Paper]
- Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. [Paper]
- Deep Voice 3: Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. [Paper]
- A unified system for voice cloning and voice conversion through diffusion probabilistic modeling. [Paper] [Demo]
- Coqui-TTS (https://github.com/coqui-ai/tts).
- Real-Time-Voice-Cloning (https://github.com/CorentinJ/Real-Time-Voice-Cloning) Even though this tool seems to be the one on Github with the highest number of stars, the project itself has been archived because the development team is migrating to a startup of ResembleAI(https://www.resemble.ai/). However, this the simplest voice clone tool worth a trial(https://www.youtube.com/watch?v=1WN8Jhfd4uM). Video intro from author: https://www.youtube.com/watch?v=-O_hYhToKoA.
- Tacotron 2 (https://github.com/NVIDIA/tacotron2). Voice Clone Tutorial: https://www.youtube.com/watch?v=b1fzyM0VhhI
- deep-voice-conversion (https://github.com/andabi/deep-voice-conversion)
- Tacotron (https://github.com/keithito/tacotron)
- MyCroftAI Mimic2 (https://github.com/MycroftAI/mimic2)
- Neural Voice Cloning with Few Samples. (https://github.com/Sharad24/Neural-Voice-Cloning-with-Few-Samples)
- Deepvoice3_pytorch. (https://github.com/r9y9/deepvoice3_pytorch)
Even though the development of deepfake audio is fast, the realistic loss is usually not caused by the sole fake audio alone. In current research of fake audio detection, one natural metric is Mean Opinion Score(MOS) by people themselves. Generally speaking, currently generated audio clips seeking voice clone are still distinguishable from the following aspects:
- Inconsistent sentences
- Plain or sudden varying tone in speech
- Phrasing – consider the habit of the speaker
- Context of speech – Is it really the discussion topic that should be conducted?
The most important point is to hold a Zero-Trust Ideology for sensitive cases: always be alert, always verify!
- Deep Learning for Deepfakes Creation and Detection (https://arxiv.org/abs/1909.11573).
- DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection (https://arxiv.org/abs/2001.00179)
- Media Forensics and DeepFakes: an overview (https://arxiv.org/abs/2001.06564).
- Will Deepfakes Do Deep Damage? (https://cacm.acm.org/magazines/2020/1/241708-will-deepfakes-do-deep-damage/fulltext).
- Tutorials/Papers from NII Yamagishi Lab team. (https://nii-yamagishilab.github.io/, http://tonywangx.github.io/slide.html)
- FastAudio: A Learnable Audio Front-End for Spoof Speech Detection. [Paper] [Code]
- Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model. [Paper] [Code]
- An Initial Investigation for Detecting Partially Spoofed Audio. [Paper]
- Defending Against Deepfakes Using Adversarial Attacks on Conditional Image Translation Networks.
[Paper] [Code]
- Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models. [Paper] [Code]
- ASVProof Challenge (https://www.asvspoof.org/) ASVProof Challenge: a contest for fake audo-based attack detection that has been held for several times since 2015. It defines three types of audio attacks including deepfake audios.
- ICASSP ADD 2022 (https://arxiv.org/abs/2202.08433) ICASSP ADD 2022: the first audio deep synthesis detection challenge(madarin language dataset)
- Voice Privacy Challenge (https://www.voiceprivacychallenge.org) VoicePrivacy: develop anonymisation and pseudonymisation solutions which suppress personally identifiable information while preserving linguistic content and speech quality/naturalness. It is worth pointing out, privacy is a very widespread concept. According to the VoicePrivacy2020 Summary Report, this could include e.g., age, gender, ethnic origin, geographical background, health or emotional state, political orientations, and religious beliefs. Speaker recognition systems can also reveal the speaker’s identity. So we can say DeepFake audios can be regarded as kind of voice privacy leakage.
- Kaggle Deepfake Detection Challenge (https://www.kaggle.com/c/deepfake-detection-challenge/overview?utm_medium=email&utm_source=intercom&utm_campaign=deepfake-competition-2019) this contest was conducted in 2019, it is mainly focused on deepfake video detection, however, its dataset includes some modified audio parts
- CSIG Contest(http://fmfcc.net/#introduction) Chinese audio deepfake contest. They provide a reference list of tools they use for dataset generation: http://fmfcc.net/contest-introduction
- Fake-or-Real(FOR) Dataset (https://bil.eecs.yorku.ca/datasets/#:~:text=The%20Fake%2Dor%2DReal%20(,classifiers%20to%20detect%20synthetic%20speech.)
We observe an enormous development in AI applications in audio scenarios. There are many startups coming out in this area. We include some of them below. Please note some of them are still in Beta stage so they do not have official products released yet. We are only making a selected collection here to show the current status. We don't have any financial relationship with these companies.
- Resemble AI(https://www.resemble.ai/andy-warhol/) Resemble AI is a startup that is using three minutes audio for voice clone. They stem from the dev team of the Real-Time-Voice-Clone toolbox mentioned above. One interesting feature they introduce is the emotion masks that can be applied to the generated voice: https://www.youtube.com/watch?v=cXtgdsWw1xI
- Descript Overdub (https://www.descript.com/overdub)
There is an online review for its voice clone function here: https://www.youtube.com/watch?v=H_SVsTOZHeM - Respeecher (https://www.respeecher.com/) Speaking in the voice of Obama (Voice cloning): https://www.youtube.com/watch?v=t5yw5cR79VA
- Altered Studio (www.altered.ai) Demo video: https://www.youtube.com/watch?v=-plK4bL2-2E
- Voice.AI (https://voice.ai/) Demo video: https://www.youtube.com/watch?v=r5ljaht4Qso
- VoiceMod (https://www.voicemod.net/) Demo video: https://www.youtube.com/watch?v=yUZ64EXYvO4
- ID R&D(https://www.idrnd.ai/voice-anti-spoofing/) Speech conversion, replay attacks, and TTS create certain signal artifacts that are sometimes indistinguishable by a human ear. ID R&D algorithms try to find and identify such artifacts to determine liveness of the sound source.
- FBI Warning (https://www.ic3.gov/Media/Y2022/PSA220628) Deepfakes and Stolen PII Utilized to Apply for Remote Work Positions
- DHS Report (https://www.dhs.gov/sites/default/files/publications/increasing_threats_of_deepfake_identities_0.pdf) Department of Homeland Security provides a comprehensive report on the topic of deepfake threats. A relevant report: https://datasociety.net/library/deepfakes-and-cheap-fakes/
- DARPA MediFor Program (https://www.darpa.mil/program/media-forensics) DARPA’s Media Forensics program (MediFor) builds algorithms to detect manipulated images or videos.(this seems to mainly relevant to visual media of images and videos not relevant to audios directly)
- Bank voice verification
- Scamming calls
- Fake news generation