OptiSpeech is ment to be an efficient, lightweight and fast text-to-speech model for on-device text-to-speech.
I would like to thank Pneuma Solutions for providing GPU resources for training this model. Their support significantly accelerated my development process.
optispeech-mike.mp4
optispeech-demo.mp4
Note that this is still WIP. Final model designed decisions are still being made.
If you want an inference-only minimum -dependency package that doesn't require pytorch
, you can use ospeech
We use uv to manage the python runtime and dependencies.
Install uv
first, then run the following:
$ git clone https://github.com/mush42/optispeech
$ cd optispeech
$ uv sync
$ python3 -m optispeech.infer --help
usage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]
checkpoint text output_dir
Speaking text using OptiSpeech
positional arguments:
checkpoint Path to OptiSpeech checkpoint
text Text to synthesise
output_dir Directory to write generated audio to.
options:
-h, --help show this help message and exit
--d-factor D_FACTOR Scale to control speech rate
--p-factor P_FACTOR Scale to control pitch
--e-factor E_FACTOR Scale to control energy
--cuda Use GPU for inference
import soundfile as sf
from optispeech.model import OptiSpeech
# Load model
device = torch.device("cpu")
ckpt_path = "/path/to/checkpoint"
model = OptiSpeech.load_from_checkpoint(ckpt_path, map_location="cpu")
model = model.to(device)
model = model.eval()
# Text preprocessing and phonemization
sentence = "A rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky."
inference_inputs = model.prepare_input(sentence)
inference_outputs = model.synthesize(inference_inputs)
inference_outputs = inference_outputs.as_numpy()
wav = inference_outputs.wav
sf.write("output.wav", wav.squeeze(), model.sample_rate)
Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.
Training is easy as 1, 2, 3:
Given a dataset that is organized as follows:
├── train
│ ├── metadata.csv
│ └── wav
│ ├── aud-00001-0003.wav
│ └── ...
└── val
├── metadata.csv
└── wav
├── aud-00764.wav
└── ...
The metadata.csv
file can contain 2, 3 or 4 columns delimited by | (bar character) in one of the following formats:
- 2 columns: file_id|text
- 3 columns: file_id|speaker_id|text
- 4 columns: file_id|speaker_id|language_id|text
Use the preprocess_dataset
script to prepare the dataset for training:
$ python3 -m optispeech.tools.preprocess_dataset --help
usage: preprocess_dataset.py [-h] [--format {ljspeech}] dataset input_dir output_dir
positional arguments:
dataset dataset config relative to `configs/data/` (without the suffix)
input_dir original data directory
output_dir Output directory to write datafiles + train.txt and val.txt
options:
-h, --help show this help message and exit
--format {ljspeech} Dataset format.
If you are training on a new dataset, you must calculate and add **data_statistics ** using the following script:
$ python3 -m optispeech.tools.generate_data_statistics --help
usage: generate_data_statistics.py [-h] [-b BATCH_SIZE] [-f] [-o OUTPUT_DIR] input_config
positional arguments:
input_config The name of the yaml config file under configs/data
options:
-h, --help show this help message and exit
-b BATCH_SIZE, --batch-size BATCH_SIZE
Can have increased batch size for faster computation
-f, --force force overwrite the file
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory to save the data statistics
OptiSpeech provides interchangeable types of backbones for the model's encoder and decoder, you can choose the backbone based on your target performance profile.
To help you choose, here's a quick computational-complexity analysis of the available backbones:
Backbone | Config File | FLOPs | MACs | #Params |
---|---|---|---|---|
ConvNeXt | optispeech.yaml |
10.57 GFLOPS | 5.27 GMACs | 15.89 M |
Light | light.yaml |
7.88 GFLOPS | 3.93 GMACs | 10.74 M |
Transformer | transformer.yaml |
14.15 GFLOPS | 7.06 GMACs | 17.98 M |
Conformer | conformer.yaml |
20.42 GFLOPS | 10.19 GMACs | 24.35 M |
The default backbone is ConvNeXt
, but if you want to change it you can edit your experiment config.
To start training run the following command. Note that this training run uses config from hfc_female-en_US. You can copy and update it with your own config values, and pass the name of the custom config file (without extension) instead.
$ python3 -m optispeech.train experiment=hfc_female-en_us
$ python3 -m optispeech.onnx.export --help
usage: export.py [-h] [--opset OPSET] [--seed SEED] checkpoint_path output
Export OptiSpeech checkpoints to ONNX
positional arguments:
checkpoint_path Path to the model checkpoint
output Path to output `.onnx` file
options:
-h, --help show this help message and exit
--opset OPSET ONNX opset version to use (default 15
--seed SEED Random seed
$ python3 -m optispeech.onnx.infer --help
usage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]
onnx_path text output_dir
ONNX inference of OptiSpeech
positional arguments:
onnx_path Path to the exported LeanSpeech ONNX model
text Text to speak
output_dir Directory to write generated audio to.
options:
-h, --help show this help message and exit
--d-factor D_FACTOR Scale to control speech rate.
--p-factor P_FACTOR Scale to control pitch.
--e-factor E_FACTOR Scale to control energy.
--cuda Use GPU for inference
Repositories I would like to acknowledge:
- BetterFastspeech2: For repo backbone
- LightSpeech: for the transformer backbone
- JETS: for the phoneme-mel alignment framework
- Vocos: For pioneering the use of ConvNext in TTS
- Piper-TTS: For leading the charge in on-device TTS. Also for the great phonemizer
@inproceedings{luo2021lightspeech,
title={Lightspeech: Lightweight and fast text to speech with neural architecture search},
author={Luo, Renqian and Tan, Xu and Wang, Rui and Qin, Tao and Li, Jinzhu and Zhao, Sheng and Chen, Enhong and Liu, Tie-Yan},
booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={5699--5703},
year={2021},
organization={IEEE}
}
@article{siuzdak2023vocos,
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author={Siuzdak, Hubert},
journal={arXiv preprint arXiv:2306.00814},
year={2023}
}
@INPROCEEDINGS{10446890,
author={Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion},
year={2024},
volume={},
number={},
pages={12456-12460},
keywords={Vocoders;Neural networks;Signal processing;Transformers;Real-time systems;Acoustics;Decoding;ConvNeXt;JETS;text-to-speech;voice conversion;WaveNeXt},
doi={10.1109/ICASSP48485.2024.10446890}
}
Copyright (c) Musharraf Omer. MIT Licence. See LICENSE for more details.