You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To develop a robust end-to-end Transformer-based Text-to-Speech (TTS) model that efficiently converts textual input into natural, high-quality speech output. The model aims to leverage the self-attention mechanism to capture long-range dependencies in text sequences, enabling more accurate prosody, intonation, and contextual understanding compared to traditional models. The goal is to create a system that can generalize well across various languages and speaking styles, ensuring smooth, realistic voice synthesis with minimal preprocessing and training time.
📘Details
A Pytorch Implementation of end-to-end Speech Synthesis using Transformer Network.
This model can be trained almost 3 to 4 times faster than most of the autoregressive models, since Transformers lie under one of the fastest computing autoregressive models.
We learned the post network using CBHG(Convolutional Bank + Highway network + GRU) model of tacotron and converted the spectrogram into raw wave using griffin-lim algorithm, and in future We want to use pre-trained hifi-gan vocoder for generating raw audio.
We used The LJSpeech Dataset (aka LJSpeech-1.1), a speech dataset which consists of pairs of text script and short audio(wavs) clips of a single speaker. The complete dataset (13,100 pairs) can be downloaded either from Kaggle or Keithito
.
This is the raw data which will be prepared further for training.
✅Pretrained Model Checkpoints
You can download the pretrained model checkpoints from Checkpoints (50k for Transformer model / 45k for Postnet)
You can load the checkpoints for the respective models.
☢️Attention Plots
Attention Plots represent the multihead attention of all layers, num_heads=4 is used for three attention layers.
Only a few multiheads showed diagonal alignment i.e. Diagonal alignment in attention plots typically suggests that the model is learning to align tokens in a sequence effectively.
Self Attention Encoder
Self Attention Decoder
Attention Encoder-Decoder
📈Learning curves & Alphas
I used Noam-style warmup and decay. This refers to a learning rate schedule commonly used in training deep learning models, particularly in the context of Transformer models(as introduced in in the "Attention is All You Need" paper)
The image below shows the alphas of scaled positional encoding. The encoder alpha is constant for almost first 15k steps and then increases for the rest of the training. The decoder alpha decreases a bit for first 2k steps then it is almost constant for rest of the training.
🗒Experimental Notes
We didn't use the stop token in the implementation, since model didn't train with its usage.
For Transformer model, it is very important to concatenate the input and context vectors for correctly utilising the Attention mechanism.
hyperparams.py contains all the hyperparams that are required in this Project.
prepare_data.py performs preparing of data which is converting raw audio to mel, linear spectrogram for faster training time. The scripts for preprocessing of text data is in ./text/ directory.
prepare_data.ipynb is the notebook to be run for preparing the data for further training.
preprocess.py contains all the methods for loading the dataset.
module.py contains all the methods like Encoder Prenet, Feed Forward Network(FFN), PostConvolutional Network, MultiHeadAttention, Attention, Prenet, CBHG(Convolutional Bank + Highway + Gated), etc.
network.py contains Encoder, MelDecoder, Model and Model Postnet networks.
train_transformer.py contains the script for training the autoregressive attention network. (text --> mel)
Text-to-Speech-Training-Transformer.ipynb is the notebook to be run for training the transformer network.
train_postnet.py contains the script for training the PostConvolutional network. (mel --> linear)
Text-to-Speech-Training-Postnet.ipynb is the notebook to be run for training the PostConvolutional network.
synthesis.py contains the script to generate the audio samples by the trained Text-to-Speech model.
Text-to-Speech-Audio-Generation.ipynb is the notebook to be run for generating audio samples by loading trained model checkpoints
utils.py contains the methods for detailed preprocessing particularly for mel spectrogram and audio waveforms.
🤖Training the Network
Preparing Data
STEP 1. Download and extract LJSpeech-1.1 data at any directory you want.
STEP 2. Change these two paths in hyperparams.py according to your system paths for preparing data locally.
# For local use: (prepare_data.ipynb)
data_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'
output_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'
STEP 3. Run the prepare_data.ipynb after correctly assigning the paths.
STEP 4. The prepared data will be stored in the form: