Skip to content

Latest commit

 

History

History
210 lines (165 loc) · 7.16 KB

README.md

File metadata and controls

210 lines (165 loc) · 7.16 KB

Voice Dataset Creation

This repo outlines the steps and scripts necessary to create your own text-to-speech dataset for training a voice model. The final output is in LJSpeech format.

Flow Chart

Table of Contents


Purple Create Your Own Voice Recordings

Requirements

  • Voice Recording Software
  • Omni-directional head-mounted microphone
  • Good quality audio card

Create a Text Corpus of Sentences

  • Create sentences that will be about 3-10 seconds when spoken
  • Use LJSpeech format
    • "|" separated values, wav file id then sentence text
    • 100|this is an example sentence

Speak and Record Sentences

  • Speak each sentence as written
  • Sample rate should be 22050 or greater

Sentence Lengths

Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.


Pink Create a Synthetic Voice Dataset

Requirements

  • Google Cloud Platform Compute Engine Instance
    • Cloud API access scopes select Allow full access to all Cloud APIs
  • Conda

Installation

Create Conda Environment on GCP Instance

conda create -n tts python=3.7
conda activate tts
pip install google-cloud-texttospeech==2.1.0 tqdm pandas

Create a Text Corpus of Sentences

  • Create sentences that will be about 3-10 seconds when spoken
  • Use LJSpeech format
    • "|" separated values, wav file id then sentence text
    • 100|this is an example sentence

Generate Synthetic Voice Dataset

  • python text_to_wav.py tts_generate

Sentence Lengths

Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.


Blue Create Transcriptions for Existing Voice Recordings

Requirements

  • Adobe Audition or Audacity
  • Google Cloud Platform Compute Engine Instance
    • Cloud API access scopes select Allow full access to all Cloud APIs
  • Conda

Installation

Create Conda Environment on GCP Instance

conda create -n stt python=3.7
conda activate stt
pip install google-cloud-speech tqdm pandas

Fill out a Datasheet for the Voice Dataset

Mark the Speech

In Adobe Audition, open audio file:

  • Select Diagnostics -> Mark Audio
  • Select the Mark the Speech preset
  • Click Scan
  • Click Find Levels
  • Click Scan again
  • Click Mark All
  • Adjust audio and silence signal dB and length until clips are between 3-10 seconds

Or, in Audacity, open audio file:

  • Select Analyze->Sound Finder
  • Adjust audio and silence signal dB and length until clips are between 3-10 seconds

Adjust Markers or Label Boundaries

In Audition:

  • Open Markers Tab
  • Adjust markers, removing silence and noise to make clip length between 3 to 10 seconds long

In Audition:

  • Adjust label boundaries, removing silence and noise to make clip length between 3 to 10 seconds long

Export Markers/Labels and WAVs

In Audition:

  • Select all markers in list
  • Select Export Selected Markers to CSV and save as Markers.csv
  • Select Preferences -> Media & Disk Cache and Untick Save Peak Files
  • Select Export Audio of Selected Range Markers with the following options:
    • Check Use marker names in filenames
    • Update Format to WAV PCM
    • Update Sample Type 22050 Hz Mono, 16-bit
    • Use folder wavs_export

Or, in Audacity:

  • Select Export multiple...
    • Format: WAV
    • Options: Signed 16-bit PCM
    • Split files based on Labels
    • Name files using Label/Track Name
    • Use folder wavs_export
  • Select Export labels to Label Track.txt

Analyze WAVs with Signal to Noise Ratio Colab

Create Initial Transcriptions with STT

For Audition, using the exported Markers.csv and wavs folder run:

cd scripts
python wav_to_text.py audition

The script generates a new file, Markers_STT.csv.

For Audacity, using the exported Label Track.txt and wavs folder run:

cd scripts
python wav_to_text.py audacity

The script generates a new file, Label Track STT.csv.

Fine-tune Transcriptions

For Audition:

  • Delete all markers
  • Select Import Markers from File and select file with STT transcriptions: Markers_STT.csv
  • Fine-tune the Description field in Markers to exactly match the words spoken

For Audacity:

  • Open Label Track STT.txt in a text editor.
  • Fine-tune the Labels field in the text file to exactly match the words spoken

Export Markers (Audition only) and WAVs

For Audition:

  • Select all markers in list
  • Select Export Selected Markers to CSV and save as Markers.csv
  • Select Export Audio of Selected Range Markers with the following options:
    • Check Use marker names in filenames
    • Update Format to WAV PCM
    • Update Sample Type 22050 Hz Mono, 16-bit
    • Use folder wavs_export

For Audacity:

  • Select Export multiple...
    • Format: WAV
    • Options: Signed 16-bit PCM
    • Split files based on Labels
    • Name files using Label/Track Name
    • Use folder wavs_export

Convert Markers(Audition) or Labels(Audacity) into LJSpeech format

Using the exported Markers.csv(Audition) or Label Track STT.txt (Audacity) and WAVs in wavs_export, scripts/markersfile_to_metadata.py will create a metadata.csv and folder of WAVs to train your TTS model:

For Audition:

python markersfile_to_metadata.py audition

For Audacity:

python markersfile_to_metadata.py audacity

Sentence Lengths

Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.


Other Utilities

Upsample WAV file

ffmpeg:ffmpeg resampy:resampy We tested three methods to upsample WAV files from 16,000 to 22,050 Hz. After reviewing the spectrograms, we selected ffmpeg for upsampling as it includes another 2 KHz of high end information when compared to resampy. scripts/resamplewav.sh

scripts/resamplewav.sh

References