This repo outlines the steps and scripts necessary to create your own text-to-speech dataset for training a voice model. The final output is in LJSpeech format.
- Create Your Own Voice Recordings
- Create a Synthetic Voice Dataset
- Create Transcriptions for Existing Voice Recordings
- Other Utilities
- Voice Recording Software
- Omni-directional head-mounted microphone
- Good quality audio card
- Create sentences that will be about 3-10 seconds when spoken
- Use LJSpeech format
- "|" separated values, wav file id then sentence text
100|this is an example sentence
- Speak each sentence as written
- Sample rate should be 22050 or greater
Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.
- Google Cloud Platform Compute Engine Instance
Cloud API access scopes
selectAllow full access to all Cloud APIs
- Conda
Create Conda Environment on GCP Instance
conda create -n tts python=3.7
conda activate tts
pip install google-cloud-texttospeech==2.1.0 tqdm pandas
- Create sentences that will be about 3-10 seconds when spoken
- Use LJSpeech format
- "|" separated values, wav file id then sentence text
100|this is an example sentence
python text_to_wav.py tts_generate
Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.
- Adobe Audition or Audacity
- Google Cloud Platform Compute Engine Instance
Cloud API access scopes
selectAllow full access to all Cloud APIs
- Conda
Create Conda Environment on GCP Instance
conda create -n stt python=3.7
conda activate stt
pip install google-cloud-speech tqdm pandas
- Review Datasheets for Datasets by Gebru et al.: https://arxiv.org/pdf/1803.09010.pdf
- Markdown Datasheet: https://github.com/JRMeyer/markdown-datasheet-for-datasets/blob/master/DATASHEET.md
In Adobe Audition, open audio file:
- Select
Diagnostics
->Mark Audio
- Select the
Mark the Speech
preset - Click
Scan
- Click
Find Levels
- Click
Scan
again - Click
Mark All
- Adjust audio and silence signal dB and length until clips are between 3-10 seconds
Or, in Audacity, open audio file:
- Select
Analyze
->Sound Finder
- Adjust audio and silence signal dB and length until clips are between 3-10 seconds
In Audition:
- Open
Markers
Tab - Adjust markers, removing silence and noise to make clip length between 3 to 10 seconds long
In Audition:
- Adjust label boundaries, removing silence and noise to make clip length between 3 to 10 seconds long
In Audition:
- Select all markers in list
- Select
Export Selected Markers to CSV
and save as Markers.csv - Select
Preferences
->Media & Disk Cache
and UntickSave Peak Files
- Select
Export Audio of Selected Range Markers
with the following options:- Check
Use marker names in filenames
- Update Format to
WAV PCM
- Update Sample Type
22050 Hz Mono, 16-bit
- Use folder
wavs_export
- Check
Or, in Audacity:
- Select
Export multiple...
- Format: WAV
- Options: Signed 16-bit PCM
- Split files based on Labels
- Name files using Label/Track Name
- Use folder
wavs_export
- Select
Export labels
toLabel Track.txt
- run colabs/voice_dataset_SNR.ipynb
- Clean or remove noisy files
For Audition, using the exported Markers.csv
and wavs folder run:
cd scripts
python wav_to_text.py audition
The script generates a new file, Markers_STT.csv
.
For Audacity, using the exported Label Track.txt
and wavs folder run:
cd scripts
python wav_to_text.py audacity
The script generates a new file, Label Track STT.csv
.
For Audition:
- Delete all markers
- Select
Import Markers from File
and select file with STT transcriptions: Markers_STT.csv - Fine-tune the Description field in Markers to exactly match the words spoken
For Audacity:
- Open
Label Track STT.txt
in a text editor. - Fine-tune the Labels field in the text file to exactly match the words spoken
For Audition:
- Select all markers in list
- Select
Export Selected Markers to CSV
and save as Markers.csv - Select
Export Audio of Selected Range Markers
with the following options:- Check
Use marker names in filenames
- Update Format to
WAV PCM
- Update Sample Type
22050 Hz Mono, 16-bit
- Use folder
wavs_export
- Check
For Audacity:
- Select
Export multiple...
- Format: WAV
- Options: Signed 16-bit PCM
- Split files based on Labels
- Name files using Label/Track Name
- Use folder
wavs_export
Using the exported Markers.csv
(Audition) or Label Track STT.txt
(Audacity) and WAVs in wavs_export, scripts/markersfile_to_metadata.py will create a metadata.csv and folder of WAVs to train your TTS model:
For Audition:
python markersfile_to_metadata.py audition
For Audacity:
python markersfile_to_metadata.py audacity
Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.
ffmpeg: resampy: We tested three methods to upsample WAV files from 16,000 to 22,050 Hz. After reviewing the spectrograms, we selected ffmpeg for upsampling as it includes another 2 KHz of high end information when compared to resampy. scripts/resamplewav.sh
scripts/resamplewav.sh
- Mozilla TTS: https://github.com/mozilla/TTS
- Automating alignment, includes segment audio on silence, Google Speech API, and recognition alignment: https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow#2-2-generate-korean-datasets
- Pretraining on large synthetic corpuses and fine tuning on specific ones https://twitter.com/garygarywang
- Datasheets for Datasets https://arxiv.org/abs/1803.09010