Skip to content

Latest commit

 

History

History
202 lines (194 loc) · 13.9 KB

README.md

File metadata and controls

202 lines (194 loc) · 13.9 KB

Text-to-Speech
▶• ılıılıılıılıılıılı

🎯Aim

To develop a robust end-to-end Transformer-based Text-to-Speech (TTS) model that efficiently converts textual input into natural, high-quality speech output. The model aims to leverage the self-attention mechanism to capture long-range dependencies in text sequences, enabling more accurate prosody, intonation, and contextual understanding compared to traditional models. The goal is to create a system that can generalize well across various languages and speaking styles, ensuring smooth, realistic voice synthesis with minimal preprocessing and training time.

📘Details

  • A Pytorch Implementation of end-to-end Speech Synthesis using Transformer Network.
  • This model can be trained almost 3 to 4 times faster than most of the autoregressive models, since Transformers lie under one of the fastest computing autoregressive models.
  • We learned the post network using CBHG(Convolutional Bank + Highway network + GRU) model of tacotron and converted the spectrogram into raw wave using griffin-lim algorithm, and in future We want to use pre-trained hifi-gan vocoder for generating raw audio.

🦾Transformer Architecture

⚙️Tech Stack

Category Technologies
Programming Languages Python
Frameworks PyTorch
Libraries falcon inflect librosa scipy Unidecode pandas numpy tqdm torchvision torchaudio
Deep Learning Models Transformers CBHG CNN
Dataset LJSpeech
Tools Git Google Colab Kaggle
Visualization & Analysis Matplotlib

📁File Structure


Text-to-Speech/
│
├── README.md
├── Text-to-Speech-Audio-Generation.ipynb
├── Text-to-Speech-Training-Postnet.ipynb
├── Text-to-Speech-Training-Transformer.ipynb
├── hyperparams.py
├── module.py
├── network.py
├── prepare_data.ipynb
├── prepare_data.py
├── preprocess.py
├── requirements.txt
├── synthesis.py
├── train_postnet.py
├── train_transformer.py
├── utils.py
│
├── __pycache__/
│   ├── hyperparams.cpython-311.pyc
│   └── utils.cpython-311.pyc
│
├── png/
│   ├── alphas.png
│   ├── attention.gif
│   ├── attention_encoder.gif
│   ├── attention_decoder.gif
│   ├── model.png
│   ├── test_loss_per_epoch.png
│   ├── training_loss.png
│   └── training_loss_per_epoch.png
│
└── text/
    ├── __init__.py
    ├── cleaners.py
    ├── cmudict.py
    ├── numbers.py
    └── symbols.py

📝Requirements

  • Install python==3.11.10
  • Install requirements:
pip install -r requirements.txt

📊Data

  • We used The LJSpeech Dataset (aka LJSpeech-1.1), a speech dataset which consists of pairs of text script and short audio(wavs) clips of a single speaker. The complete dataset (13,100 pairs) can be downloaded either from Kaggle or Keithito .
  • This is the raw data which will be prepared further for training.

✅Pretrained Model Checkpoints

  • You can download the pretrained model checkpoints from Checkpoints (50k for Transformer model / 45k for Postnet)
  • You can load the checkpoints for the respective models.

☢️Attention Plots

  • Attention Plots represent the multihead attention of all layers, num_heads=4 is used for three attention layers.
  • Only a few multiheads showed diagonal alignment i.e. Diagonal alignment in attention plots typically suggests that the model is learning to align tokens in a sequence effectively.

Self Attention Encoder

Self Attention Decoder

Attention Encoder-Decoder

📈Learning curves & Alphas

  • I used Noam-style warmup and decay. This refers to a learning rate schedule commonly used in training deep learning models, particularly in the context of Transformer models(as introduced in in the "Attention is All You Need" paper)

  • The image below shows the alphas of scaled positional encoding. The encoder alpha is constant for almost first 15k steps and then increases for the rest of the training. The decoder alpha decreases a bit for first 2k steps then it is almost constant for rest of the training.

🗒Experimental Notes

  1. We didn't use the stop token in the implementation, since model didn't train with its usage.
  2. For Transformer model, it is very important to concatenate the input and context vectors for correctly utilising the Attention mechanism.

🔊Generated Samples

📋File Description

  • hyperparams.py contains all the hyperparams that are required in this Project.
  • prepare_data.py performs preparing of data which is converting raw audio to mel, linear spectrogram for faster training time. The scripts for preprocessing of text data is in ./text/ directory.
  • prepare_data.ipynb is the notebook to be run for preparing the data for further training.
  • preprocess.py contains all the methods for loading the dataset.
  • module.py contains all the methods like Encoder Prenet, Feed Forward Network(FFN), PostConvolutional Network, MultiHeadAttention, Attention, Prenet, CBHG(Convolutional Bank + Highway + Gated), etc.
  • network.py contains Encoder, MelDecoder, Model and Model Postnet networks.
  • train_transformer.py contains the script for training the autoregressive attention network. (text --> mel)
  • Text-to-Speech-Training-Transformer.ipynb is the notebook to be run for training the transformer network.
  • train_postnet.py contains the script for training the PostConvolutional network. (mel --> linear)
  • Text-to-Speech-Training-Postnet.ipynb is the notebook to be run for training the PostConvolutional network.
  • synthesis.py contains the script to generate the audio samples by the trained Text-to-Speech model.
  • Text-to-Speech-Audio-Generation.ipynb is the notebook to be run for generating audio samples by loading trained model checkpoints
  • utils.py contains the methods for detailed preprocessing particularly for mel spectrogram and audio waveforms.

🤖Training the Network

  1. Preparing Data
    • STEP 1. Download and extract LJSpeech-1.1 data at any directory you want.
    • STEP 2. Change these two paths in hyperparams.py according to your system paths for preparing data locally.

      # For local use: (prepare_data.ipynb)
      data_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'
      output_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'
    • STEP 3. Run the prepare_data.ipynb after correctly assigning the paths.
    • STEP 4. The prepared data will be stored in the form:

    • LJSpeech-1.1/
      │
      ├── README.md
      ├── metadata.csv
      ├── wavs/
      │   ├── LJ001-001.wav
      │   ├── LJ001-001.mag.npy
      │   ├── LJ001-001.pt.npy
      │   ├── LJ001-002.wav
      │   ├── LJ001-002.mag.npy
      │   ├── LJ001-002.pt.npy
      │   └── ...
    • Prepared data is uploaded to kaggle datasets for direct use.

  2. Training Transformer
    • STEP 1. For Training Transformer adjust these paths in hyperparams.py.

      # General:
      data_path = 'your\path\to\LJSpeech-1.1'
      checkpoint_path = 'your\path\to\outputdir'
    • STEP 2. Run the Text-to-Speech-Training-Transformer.ipynb after correctly assigning the paths.

  3. Training Postnet
    • STEP 1. For Training Posnet adjust these paths in hyperparams.py.

      # General:
      data_path = 'your\path\to\LJSpeech-1.1'
      checkpoint_path = 'your\path\to\outputdir'
    • STEP 2. Run the Text-to-Speech-Training-Postnet.ipynb after correctly assigning the paths.

📻Generate Audio Samples

  • STEP 1. Change the audio sample output path in hyperparams.py

    sample_path = 'your\path\to\outputdir\of\samples'
  • STEP 2. Run the Text-to-Speech-Audio-Generation.ipynb but make sure to run with correct arguments:

    --transformer_checkpoint your\path\to\checkpoint_transformer_50000.pth.tar 
    --postnet_checkpoint your\path\to\checkpoint_postnet_45000.pth.tar 
    --max_len 400 
    --text "Your Text Input"

🤝Acknowledgements

  • We are grateful to CoC VJTI and the Project X programme.
  • Special thanks to our mentor Warren Jacinto for perfectly mentoring and supporting us throughout.
  • Additionally, we are also thankful for all the Project X mentors for their inputs and advice on our project.
GitHub view counter