Skip to content

dustinjoe/Deepfakes-Audio-Materials

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Selected-Deepfakes-Audio-Materials Awesome

A curated, but probably biased and incomplete, list of awesome Deepfakes resources specially selected for audios. Even though this is a selected collection that cannot cover all relevant materials, we are trying to make this repo more self-contained.

This collection is enlightened by the work of a general deepfake material collection from (https://github.com/datamllab/awesome-deepfakes-materials). As both deep learning and deepfakes are evolving fast these days, feel like it would be somehow helpful to create a separate repo here to collect materials on Deepfake audios. If you want to contribute to this list, feel free to pull a request.

What is Deepfakes?

In December 2017, a user with name “DeepFakes” posted realistic looking videos of famous celebrities on Reddit. These fake videos are generated using deep learning, by swapping faces of original adult movies with celebrities’ faces. Since then, the topic of DeepFakes goes viral on internet.

Here, we denote DeepFakes as any fake contents generated by deep learning techniques. DeepFakes comes in different forms, perhaps the most typical ones are: 1) Videos and images, 2) Texts, and 3) Voices. Different deep learning techniques are used to generate DeepFakes. For instance, videos and images are usually created by Generative adversarial networks (GAN), while texts are mostly generated by deep language models based on Transformers.

It is worth pointing out the first type of videos and images are most well known deepfakes. This is relevant to the current high spread speed of videos and images on social media and multimedia websites like Youtube. This repo aims to collect materials on the threat of fake audios. Just as Eric Haller, Global EVP of Identity, Fraud, and DataLabs at Experian points out these days, your voice footprints have become a siginificant part of your digital identity( https://www.youtube.com/watch?v=dqhi930ksK8).

Table of Contents

Introduction

Key Concepts

Voice Verification: the biometric technology that allows you to use your unique voiceprint to access your accounts when you access your services like bank accounts. Voice Verification is simple because it makes your voice your password (https://www.wellsfargo.com/privacy-security/voice-verification/). Verification outputs a binary result of whether the voice belongs to the speaker of not. If you need to know exactly who the speaker is, that would refer to the concept of 'voice identification'.

Voice Conversion: modify the speech of a source speaker and output speech sound like that of another target speaker without changing the original linguistic information. Oh you can simply regard it as a style transfer on audio.

Voice Clone: it is interchangeable with the above word of 'voice conversion' in many cases. But you can also more strictly say it is an area in voice conversion that simply aiming to copy a target's voice without merging it with a source voice.

General Architecture

For the task of fake audio generation especially deepfake audio generation, there are three main type of models involved (definitions are extracted from this paper Who Are You (I Really Wanna Know)? Detecting Audio {DeepFakes} Through Vocal Tract Reconstruction):

(1) Encoder: The encoder learns the unique representation of the speaker’s voice, known as the speaker embedding. The embedding is derived from a short utterance using the target speaker’s voice. The accuracy of the embedding can be increased by giving the encoder more utterances, with diminishing returns. The output embedding from the encoder stage can be then passed as an input into the following synthesizer stage.

(2) Synthesizer: A synthesizer generates a Mel Spectrogram from a given text and the speaker embedding. A Mel Spectrogram is a spectrogram that has its frequencies scaled using the Mel scale, which is designed to model audio perception of the human ear. Some synthesizers can produce spectrograms solely from a sequence of characters or phonemes like Tacotron.

(3) Vocoder: The vocoder converts the Mel Spectrogram to retrieve the corresponding audio waveform. This newly generated audio waveform will ideally sound like a target individual uttering a specific sentence. A commonly used vocoder model is some variation of WaveNet, which uses a deep convolutional neural network that uses surrounding contextual information to generate its waveform.

There is a good figure that illustrates two technique paths for deepfake voice clone generation from paper Neural Voice Cloning with a Few Samples. We attach it here to further explain this process.

Two Technique Paths for DeepFake Voice Clone

From this same paper, we get ideas of these two paths:

(1) The idea of the speaker adaptation is to fine-tune a trained multi-speaker model for an unseen speaker using a few audio-text pairs. Fine-tuning can be applied to either the speaker embedding or the whole model.

(2) The idea of the speaker encoding method is to directly estimate the speaker embedding from audio samples of an unseen speaker. Such a model does not require any fine-tuning during voice cloning. Thus, the same model can be used for all unseen speakers.

Typically, speaker adaptation can achieve better performance than speaker encoding especially when more data samples can be used for the model tuning step. But the computation cost of speaker encoding is much lower because it does not need any further model tuning.

Online Articles of Deepfakes

Deepfake Voices:

Attack Generation

Even though we are focused on DeepFake audio attacks, it is worth pointing out audio based attacks can be more flexible. Typical types of audio attacks (definition extracted from ASVProof Challenge):

  • Logical Access (LA): bona fide and spoofed utterances generated using text-to-speech (TTS) and voice conversion (VC) algorithms are communicated across telephony and VoIP networks with various coding and transmission effects;
  • Physical Access (PA): bona fide utterances are made in a real, physical space in which spoofing attacks are captured and then replayed within the same physical space using replay devices of varying quality;
  • Speech Deepfake (DF): a fake audio detection task comprising bona fide and spoofed utterances generated using TTS and VC algorithms. Similar to the LA task (includes compressed data) but without speaker verification.

The LA and PA tasks are based on classical setting of automatic speaker verification (ASV). In AVSProof contests, the metric for these two tasks will be the minimum tandem decision cost function (min t-DCF details can be in the doc link above) . The new DF task has a fake media / fake audio / deepfake flavour in which there is no ASV system. The metric for the DF condition will revert to the equal error rate (EER). EER is the point when the false acceptance rate and false rejection rate are equal.

pronounciation of numbers

Fake Audio Generation

Papers:

  • Tacotron: Towards End-to-End Speech Synthesis. [Paper]
  • Neural Voice Cloning with a Few Samples. [Paper]
  • Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. [Paper]
  • Deep Voice 3: Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. [Paper]
  • A unified system for voice cloning and voice conversion through diffusion probabilistic modeling. [Paper] [Demo]

Relevant Code and Tools

Defense

Stay Alert In-Person

Even though the development of deepfake audio is fast, the realistic loss is usually not caused by the sole fake audio alone. In current research of fake audio detection, one natural metric is Mean Opinion Score(MOS) by people themselves. Generally speaking, currently generated audio clips seeking voice clone are still distinguishable from the following aspects:

  • Inconsistent sentences
  • Plain or sudden varying tone in speech
  • Phrasing – consider the habit of the speaker
  • Context of speech – Is it really the discussion topic that should be conducted?

The most important point is to hold a Zero-Trust Ideology for sensitive cases: always be alert, always verify!

Deepfake Detection

Review and General Papers:

Detector Design:

  • FastAudio: A Learnable Audio Front-End for Spoof Speech Detection. [Paper] [Code]
  • Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model. [Paper] [Code]
  • An Initial Investigation for Detecting Partially Spoofed Audio. [Paper]

Defensive Audio Processing

Attacking Deepfake Generator:

  • Defending Against Deepfakes Using Adversarial Attacks on Conditional Image Translation Networks.
    [Paper] [Code]

Anomynization:

  • Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models. [Paper] [Code]

Datasets and Challenges

Social Impacts

Enterprise Solutions

We observe an enormous development in AI applications in audio scenarios. There are many startups coming out in this area. We include some of them below. Please note some of them are still in Beta stage so they do not have official products released yet. We are only making a selected collection here to show the current status. We don't have any financial relationship with these companies.

Government Responses

Potential Threatened Scenarios

  • Bank voice verification
  • Scamming calls
  • Fake news generation

About

A selected list of materials mainly on 'Audio' Deepfakes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published