Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for text-to-speech (w/ Speecht5) #345

Merged
merged 22 commits into from
Oct 23, 2023
Merged

Add support for text-to-speech (w/ Speecht5) #345

merged 22 commits into from
Oct 23, 2023

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Oct 3, 2023

This PR adds text-to-speech support to Transformers.js, with speecht5. We will add support for bark and other models in future updates (and when Optimum supports those exports).

closes #59, #279, #315

Example usage:

import { pipeline } from '@xenova/transformers'

// Choose speaker embeddings
let speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';

// Create pipeline (NOTE: unquantized)
let synthesizer = await pipeline('text-to-speech', 'Xenova/speecht5_tts', { quantized: false });

// Generate audio from text
let out = await synthesizer('Hello, my dog is cute', { speaker_embeddings });
// {
//   audio: Float32Array(28672) [-0.0004919943166896701, -0.00023953932395670563, ...],
//   sampling_rate: 16000
// }
// (Optional) Write to .wav file
import fs from 'fs';
import wavefile from 'wavefile';

let wav = new wavefile.WaveFile();
wav.fromScratch(1, out.sampling_rate, '32f', out.audio);
fs.writeFileSync('out.wav', wav.toBuffer());
fixed.mp4

(converted to mp4 since GH doesn't allow wav)

Notes:

  • If you use the quantized versions, you'll get poor results. I believe this is because the mel-spectrogram calculations require high precision. (cc @Vaibhavs10)
  • There are minor artifacts in the output, so just need to check this (cc @fxmarty). e.g., here is the python output:
python.mp4

TODO:

  • Create server-side audio processing guide (node.js/deno)
  • Create in-browser audio processing guide (w/ web audio api)
  • Experiment with different quantization settings

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@xenova
Copy link
Collaborator Author

xenova commented Oct 4, 2023

My generate_speech function is based on this code block by @fxmarty, however, it also produces artifacts in the output:

onnx_py.mp4

(running that code in python w/ same speaker embeddings).

My hunch is that the problem is either:

  • related to dropout, or
  • related to this missing section:
    # downsample encoder attention mask
    if isinstance(model.speecht5.encoder, SpeechT5EncoderWithSpeechPrenet):
        encoder_attention_mask = model.speecht5.encoder.prenet._get_feature_vector_attention_mask(
            encoder_out[0].shape[1], encoder_attention_mask
        )

will continue investigating.

@xenova
Copy link
Collaborator Author

xenova commented Oct 5, 2023

@fxmarty The dropout patch fixed it! 🥳

fixed.mp4

@xenova
Copy link
Collaborator Author

xenova commented Oct 5, 2023

And the volume difference can be fixed by multiplying the waveform by some constant factor. I don't see any post-processing in the generate_speech function in transformers, so I assume the JS library I used to export (wavefile) and the python library I used to export (soundfile) have different scaling defaults for 1-channel audio. Either way, not an issue here.

Python

py.mp4

JavaScript

  • no scaling

    js_no_scaling.mp4
  • scale by sqrt(2)

    js_scaled.mp4
  • scale by 2

    js_scaled2.mp4

@xenova xenova linked an issue Oct 23, 2023 that may be closed by this pull request
@xenova xenova merged commit 4a991bd into main Oct 23, 2023
4 checks passed
@xenova xenova deleted the speecht5 branch November 21, 2023 01:44
@flatsiedatsie
Copy link
Contributor

I tried to use WebGPU in v3 to increase the generation speed, but got an error. Would it be difficult to add WebGPU support for TextToSpeech?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Text to Speach [Feature request] Add text-to-speech with SpeechT5
3 participants