Skip to content

High-fidelity speech synthesis for Ukrainian using modern neural networks.

License

MIT, MIT licenses found

Licenses found

MIT
RADTTS-LICENSE
MIT
VOCOS-LICENSE
Notifications You must be signed in to change notification settings

egorsmkv/tts_uk

Repository files navigation

tts_uk

Text-to-Speech for Ukrainian

PyPI Version License MIT PyPI Downloads DOI

High-fidelity speech synthesis for Ukrainian using modern neural networks.

Statuses

CI Pipeline Dependabot Updates Snyk Security

Demo

HF Space Google Colab

Check out our demo on Hugging Face space or just listen to samples here.

Features

  • Multi-speaker model: 2 female (Tetiana, Lada) + 1 male (Mykyta) voices;
  • Fine-grained control over speech parameters, including duration, fundamental frequency (F0), and energy;
  • High-fidelity speech generation using the RAD-TTS++ acoustic model;
  • Fast vocoding using Vocos;
  • Synthesizes long sentences effectively;
  • Supports a sampling rate of 44.1 kHz;
  • Tested on Linux environments and Windows/WSL;
  • Python API (requires Python 3.9 or later);
  • CUDA-enabled for GPU acceleration.

Installation

# Install from PyPI
pip install tts-uk

# OR, for the latest development version:
pip install git+https://github.com/egorsmkv/tts_uk

# OR, use git and local setup
git clone https://github.com/egorsmkv/tts_uk
cd tts_uk
uv sync # uv will handle the virtual environment

Read uv's installation section.

Also, you can download the repository as a ZIP archive.

Getting started

Code example:

import torchaudio

from tts_uk.inference import synthesis

sampling_rate = 44_100

# Perform the synthesis, `synthesis` function returns:
# - mels: Mel spectrograms of the generated audio.
# - wave: The synthesized waveform by a Vocoder as a PyTorch tensor.
# - stats: A dictionary containing synthesis statistics (processing time, duration, speech rate, etc).
mels, wave, stats = synthesis(
    text="Ви можете протестувати синтез мовлення українською мовою. Просто введіть текст, який ви хочете прослухати.",
    voice="tetiana",  # tetiana, mykyta, lada
    n_takes=1,
    use_latest_take=False,
    token_dur_scaling=1,
    f0_mean=0,
    f0_std=0,
    energy_mean=0,
    energy_std=0,
    sigma_decoder=0.8,
    sigma_token_duration=0.666,
    sigma_f0=1,
    sigma_energy=1,
)

print(stats)

# Save the generated audio to a WAV file.
torchaudio.save("audio.wav", wave.cpu(), sampling_rate, encoding="PCM_S")

Use these Google colabs:

Or run synthesis in a terminal:

uv run example.py

If you need to synthesize articles we recommend consider wtpsplit.

Get help and support

Please feel free to connect with us using the Issues section.

License

Code has the MIT license.

Model authors

Acoustic

Vocoder

Community

Discord

Also, follow our Speech-UK initiative on Hugging Face!

Acknowledgements