Add support for `text-to-speech` (w/ Speecht5) #345

xenova · 2023-10-03T17:59:50Z

This PR adds text-to-speech support to Transformers.js, with speecht5. We will add support for bark and other models in future updates (and when Optimum supports those exports).

closes #59, #279, #315

Example usage:

import { pipeline } from '@xenova/transformers'

// Choose speaker embeddings
let speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';

// Create pipeline (NOTE: unquantized)
let synthesizer = await pipeline('text-to-speech', 'Xenova/speecht5_tts', { quantized: false });

// Generate audio from text
let out = await synthesizer('Hello, my dog is cute', { speaker_embeddings });
// {
//   audio: Float32Array(28672) [-0.0004919943166896701, -0.00023953932395670563, ...],
//   sampling_rate: 16000
// }

// (Optional) Write to .wav file
import fs from 'fs';
import wavefile from 'wavefile';

let wav = new wavefile.WaveFile();
wav.fromScratch(1, out.sampling_rate, '32f', out.audio);
fs.writeFileSync('out.wav', wav.toBuffer());

fixed.mp4

(converted to mp4 since GH doesn't allow wav)

Notes:

If you use the quantized versions, you'll get poor results. I believe this is because the mel-spectrogram calculations require high precision. (cc @Vaibhavs10)
~~There are minor artifacts in the output, so just need to check this (cc @fxmarty). e.g., here is the python output:~~

python.mp4

TODO:

Create server-side audio processing guide (node.js/deno)
Create in-browser audio processing guide (w/ web audio api)
Experiment with different quantization settings

HuggingFaceDocBuilderDev · 2023-10-03T18:05:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

xenova · 2023-10-04T10:35:53Z

My generate_speech function is based on this code block by @fxmarty, however, it also produces artifacts in the output:

onnx_py.mp4

(running that code in python w/ same speaker embeddings).

My hunch is that the problem is either:

related to dropout, or

related to this missing section:

# downsample encoder attention mask
if isinstance(model.speecht5.encoder, SpeechT5EncoderWithSpeechPrenet):
    encoder_attention_mask = model.speecht5.encoder.prenet._get_feature_vector_attention_mask(
        encoder_out[0].shape[1], encoder_attention_mask
    )

will continue investigating.

Due to bug in transformers: huggingface/transformers#26547

xenova · 2023-10-05T12:17:36Z

@fxmarty The dropout patch fixed it! 🥳

fixed.mp4

xenova · 2023-10-05T13:37:54Z

And the volume difference can be fixed by multiplying the waveform by some constant factor. I don't see any post-processing in the generate_speech function in transformers, so I assume the JS library I used to export (wavefile) and the python library I used to export (soundfile) have different scaling defaults for 1-channel audio. Either way, not an issue here.

Python

py.mp4

JavaScript

no scaling

js_no_scaling.mp4
scale by sqrt(2)

js_scaled.mp4
scale by 2

js_scaled2.mp4

flatsiedatsie · 2024-07-30T15:43:06Z

I tried to use WebGPU in v3 to increase the generation speed, but got an error. Would it be difficult to add WebGPU support for TextToSpeech?

xenova added 10 commits October 2, 2023 15:50

Add vocoder to export

9533c69

Add tokenizer.json export for speecht5 models

08966a2

Update speecht5 supported models

535574d

Create SpeechT5Tokenizer

2ed7d87

Add ones and ones_like tensor functions

8f5fef3

Add support for speecht5 text-to-speech

cfdda6d

Disambiguate SpeechSeq2Seq and Seq2SeqLM

098088e

Create TextToAudioPipeline

578f2b8

Add listed support for text-to-audio / text-to-speech

72473da

Use unquantized vocoder by default

2d7f9b3

xenova added 2 commits October 4, 2023 13:20

Skip speecht5 unit tests for now

93b697a

Due to bug in transformers: huggingface/transformers#26547

Merge branch 'main' into speecht5

9dbe086

Update example pipeline output

4fba3bd

xenova mentioned this pull request Oct 6, 2023

SpeechT5 ONNX support huggingface/optimum#1404

Merged

xenova added 9 commits October 22, 2023 19:12

Create simple in-browser TTS demo

0a35dff

Add template README

ed78b4b

Delete package-lock.json

5e9a180

Update required transformers.js version

928282e

Add link to Transformers.js

c44228e

Double -> Single quotes

206ae64

Merge branch 'main' into speecht5

718ef5f

Add link to text-to-speech demo

7d9fadf

Update sample speaker embeddings

b1c33b2

xenova linked an issue Oct 23, 2023 that may be closed by this pull request

[Feature request] Text to Speach #315

Closed

xenova merged commit 4a991bd into main Oct 23, 2023
4 checks passed

xenova deleted the speecht5 branch November 21, 2023 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `text-to-speech` (w/ Speecht5) #345

Add support for `text-to-speech` (w/ Speecht5) #345

xenova commented Oct 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 3, 2023

xenova commented Oct 4, 2023 •

edited

Loading

xenova commented Oct 5, 2023

xenova commented Oct 5, 2023 •

edited

Loading

flatsiedatsie commented Jul 30, 2024

Add support for text-to-speech (w/ Speecht5) #345

Add support for text-to-speech (w/ Speecht5) #345

Conversation

xenova commented Oct 3, 2023 • edited Loading

Example usage:

TODO:

HuggingFaceDocBuilderDev commented Oct 3, 2023

xenova commented Oct 4, 2023 • edited Loading

xenova commented Oct 5, 2023

xenova commented Oct 5, 2023 • edited Loading

Python

JavaScript

flatsiedatsie commented Jul 30, 2024

Add support for `text-to-speech` (w/ Speecht5) #345

Add support for `text-to-speech` (w/ Speecht5) #345

xenova commented Oct 3, 2023 •

edited

Loading

xenova commented Oct 4, 2023 •

edited

Loading

xenova commented Oct 5, 2023 •

edited

Loading