Text-to-Speech
▶• ılıılıılıılıılıılı

🎯Aim

To develop a robust end-to-end Transformer-based Text-to-Speech (TTS) model that efficiently converts textual input into natural, high-quality speech output. The model aims to leverage the self-attention mechanism to capture long-range dependencies in text sequences, enabling more accurate prosody, intonation, and contextual understanding compared to traditional models. The goal is to create a system that can generalize well across various languages and speaking styles, ensuring smooth, realistic voice synthesis with minimal preprocessing and training time.

📘Details

A Pytorch Implementation of end-to-end Speech Synthesis using Transformer Network.
This model can be trained almost 3 to 4 times faster than most of the autoregressive models, since Transformers lie under one of the fastest computing autoregressive models.
We learned the post network using CBHG(Convolutional Bank + Highway network + GRU) model of tacotron and converted the spectrogram into raw wave using griffin-lim algorithm, and in future We want to use pre-trained hifi-gan vocoder for generating raw audio.

🦾Transformer Architecture

⚙️Tech Stack

Category	Technologies
Programming Languages
Frameworks
Libraries
Deep Learning Models
Dataset
Tools
Visualization & Analysis

📁File Structure

Text-to-Speech/ │ ├── README.md ├── Text-to-Speech-Audio-Generation.ipynb ├── Text-to-Speech-Training-Postnet.ipynb ├── Text-to-Speech-Training-Transformer.ipynb ├── hyperparams.py ├── module.py ├── network.py ├── prepare_data.ipynb ├── prepare_data.py ├── preprocess.py ├── requirements.txt ├── synthesis.py ├── train_postnet.py ├── train_transformer.py ├── utils.py │ ├── __pycache__/ │ ├── hyperparams.cpython-311.pyc │ └── utils.cpython-311.pyc │ ├── png/ │ ├── alphas.png │ ├── attention.gif │ ├── attention_encoder.gif │ ├── attention_decoder.gif │ ├── model.png │ ├── test_loss_per_epoch.png │ ├── training_loss.png │ └── training_loss_per_epoch.png │ └── text/ ├── __init__.py ├── cleaners.py ├── cmudict.py ├── numbers.py └── symbols.py

📝Requirements

Install python==3.11.10
Install requirements:

pip install -r requirements.txt

📊Data

We used The LJSpeech Dataset (aka LJSpeech-1.1), a speech dataset which consists of pairs of text script and short audio(wavs) clips of a single speaker. The complete dataset (13,100 pairs) can be downloaded either from Kaggle or Keithito .
This is the raw data which will be prepared further for training.

✅Pretrained Model Checkpoints

You can download the pretrained model checkpoints from Checkpoints (50k for Transformer model / 45k for Postnet)
You can load the checkpoints for the respective models.

☢️Attention Plots

Attention Plots represent the multihead attention of all layers, num_heads=4 is used for three attention layers.
Only a few multiheads showed diagonal alignment i.e. Diagonal alignment in attention plots typically suggests that the model is learning to align tokens in a sequence effectively.

Self Attention Encoder

Self Attention Decoder

Attention Encoder-Decoder

📈Learning curves & Alphas

I used Noam-style warmup and decay. This refers to a learning rate schedule commonly used in training deep learning models, particularly in the context of Transformer models(as introduced in in the "Attention is All You Need" paper)

The image below shows the alphas of scaled positional encoding. The encoder alpha is constant for almost first 15k steps and then increases for the rest of the training. The decoder alpha decreases a bit for first 2k steps then it is almost constant for rest of the training.

🗒Experimental Notes

We didn't use the stop token in the implementation, since model didn't train with its usage.
For Transformer model, it is very important to concatenate the input and context vectors for correctly utilising the Attention mechanism.

🔊Generated Samples

Text:

Good Morning, Everyone!!

Audio:

goodmorning.wav

Text:

She sells seashells on the seashore.

Audio:

seashells.wav

Text:

Thank you so much Warren for all your support.

Audio:

tywarren.wav

📋File Description

hyperparams.py contains all the hyperparams that are required in this Project.
prepare_data.py performs preparing of data which is converting raw audio to mel, linear spectrogram for faster training time. The scripts for preprocessing of text data is in ./text/ directory.
prepare_data.ipynb is the notebook to be run for preparing the data for further training.
preprocess.py contains all the methods for loading the dataset.
module.py contains all the methods like Encoder Prenet, Feed Forward Network(FFN), PostConvolutional Network, MultiHeadAttention, Attention, Prenet, CBHG(Convolutional Bank + Highway + Gated), etc.
network.py contains Encoder, MelDecoder, Model and Model Postnet networks.
train_transformer.py contains the script for training the autoregressive attention network. (text --> mel)
Text-to-Speech-Training-Transformer.ipynb is the notebook to be run for training the transformer network.
train_postnet.py contains the script for training the PostConvolutional network. (mel --> linear)
Text-to-Speech-Training-Postnet.ipynb is the notebook to be run for training the PostConvolutional network.
synthesis.py contains the script to generate the audio samples by the trained Text-to-Speech model.
Text-to-Speech-Audio-Generation.ipynb is the notebook to be run for generating audio samples by loading trained model checkpoints
utils.py contains the methods for detailed preprocessing particularly for mel spectrogram and audio waveforms.

🤖Training the Network

Preparing Data

STEP 1. Download and extract LJSpeech-1.1 data at any directory you want.

STEP 2. Change these two paths in hyperparams.py according to your system paths for preparing data locally.

# For local use: (prepare_data.ipynb)

data_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'

output_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'

STEP 3. Run the prepare_data.ipynb after correctly assigning the paths.
STEP 4. The prepared data will be stored in the form:

LJSpeech-1.1/
│
├── README.md
├── metadata.csv
├── wavs/
│   ├── LJ001-001.wav
│   ├── LJ001-001.mag.npy
│   ├── LJ001-001.pt.npy
│   ├── LJ001-002.wav
│   ├── LJ001-002.mag.npy
│   ├── LJ001-002.pt.npy
│   └── ...

Prepared data is uploaded to kaggle datasets for direct use.

Training Transformer

STEP 1. For Training Transformer adjust these paths in hyperparams.py.

# General:
data_path = 'your\path\to\LJSpeech-1.1'
checkpoint_path = 'your\path\to\outputdir'

STEP 2. Run the Text-to-Speech-Training-Transformer.ipynb after correctly assigning the paths.

Training Postnet

STEP 1. For Training Posnet adjust these paths in hyperparams.py.

# General:
data_path = 'your\path\to\LJSpeech-1.1'
checkpoint_path = 'your\path\to\outputdir'

STEP 2. Run the Text-to-Speech-Training-Postnet.ipynb after correctly assigning the paths.

📻Generate Audio Samples

STEP 1. Change the audio sample output path in hyperparams.py
```
sample_path = 'your\path\to\outputdir\of\samples'
```

STEP 2. Run the Text-to-Speech-Audio-Generation.ipynb but make sure to run with correct arguments:

--transformer_checkpoint your\path\to\checkpoint_transformer_50000.pth.tar 
--postnet_checkpoint your\path\to\checkpoint_postnet_45000.pth.tar 
--max_len 400 
--text "Your Text Input"

🤝Acknowledgements

We are grateful to CoC VJTI and the Project X programme.
Special thanks to our mentor Warren Jacinto for perfectly mentoring and supporting us throughout.
Additionally, we are also thankful for all the Project X mentors for their inputs and advice on our project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text-to-Speech
▶• ılıılıılıılıılıılı

🎯Aim

📘Details

🦾Transformer Architecture

⚙️Tech Stack

📁File Structure

📝Requirements

📊Data

✅Pretrained Model Checkpoints

☢️Attention Plots

Self Attention Encoder

Self Attention Decoder

Attention Encoder-Decoder

📈Learning curves & Alphas

🗒Experimental Notes

🔊Generated Samples

Text:

Audio:

Text:

Audio:

Text:

Audio:

📋File Description

🤖Training the Network

Preparing Data

Training Transformer

Training Postnet

📻Generate Audio Samples

🤝Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text-to-Speech▶• ılıılıılıılıılıılı

🎯Aim

📘Details

🦾Transformer Architecture

⚙️Tech Stack

📁File Structure

📝Requirements

📊Data

✅Pretrained Model Checkpoints

☢️Attention Plots

Self Attention Encoder

Self Attention Decoder

Attention Encoder-Decoder

📈Learning curves & Alphas

🗒Experimental Notes

🔊Generated Samples

Text:

Audio:

Text:

Audio:

Text:

Audio:

📋File Description

🤖Training the Network

Preparing Data

Training Transformer

Training Postnet

📻Generate Audio Samples

🤝Acknowledgements

Text-to-Speech
▶• ılıılıılıılıılıılı