Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for text-to-speech (w/ Speecht5) #345

Merged
merged 22 commits into from
Oct 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ Want to jump straight in? Get started with one of our sample applications/templa
| Semantic Image Search (server-side) | Search for images with text (Supabase) | [code](./examples/semantic-image-search/), [demo](https://huggingface.co/spaces/Xenova/semantic-image-search) |
| Vanilla JavaScript | In-browser object detection | [video](https://scrimba.com/scrim/cKm9bDAg), [code](./examples/vanilla-js/), [demo](https://huggingface.co/spaces/Scrimba/vanilla-js-object-detector) |
| React | Multilingual translation website | [code](./examples/react-translator/), [demo](https://huggingface.co/spaces/Xenova/react-translator) |
| Text to speech (client-side) | In-browser speech synthesis | [code](./examples/text-to-speech-client/), [demo](https://huggingface.co/spaces/Xenova/text-to-speech-client) |
| Browser extension | Text classification extension | [code](./examples/extension/) |
| Electron | Text classification application | [code](./examples/electron/) |
| Next.js (client-side) | Sentiment analysis (in-browser inference) | [code](./examples/next-client/), [demo](https://huggingface.co/spaces/Xenova/next-example-app) |
Expand Down Expand Up @@ -222,7 +223,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AudioClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=audio-classification&library=transformers.js) |
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. | ❌ |
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&library=transformers.js) |
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | |
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | `text-to-speech` or `text-to-audio` | | Generating natural-sounding speech given text input. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TextToAudioPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=text-to-audio&library=transformers.js) |


#### Tabular
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/3_examples.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Want to jump straight in? Get started with one of our sample applications/templa
| Semantic Image Search (server-side) | Search for images with text (Supabase) | [code](./examples/semantic-image-search/), [demo](https://huggingface.co/spaces/Xenova/semantic-image-search) |
| Vanilla JavaScript | In-browser object detection | [video](https://scrimba.com/scrim/cKm9bDAg), [code](./examples/vanilla-js/), [demo](https://huggingface.co/spaces/Scrimba/vanilla-js-object-detector) |
| React | Multilingual translation website | [code](./examples/react-translator/), [demo](https://huggingface.co/spaces/Xenova/react-translator) |
| Text to speech (client-side) | In-browser speech synthesis | [code](./examples/text-to-speech-client/), [demo](https://huggingface.co/spaces/Xenova/text-to-speech-client) |
| Browser extension | Text classification extension | [code](./examples/extension/) |
| Electron | Text classification application | [code](./examples/electron/) |
| Next.js (client-side) | Sentiment analysis (in-browser inference) | [code](./examples/next-client/), [demo](https://huggingface.co/spaces/Xenova/next-example-app) |
Expand Down
2 changes: 1 addition & 1 deletion docs/snippets/5_supported-tasks.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AudioClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=audio-classification&library=transformers.js) |
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. | ❌ |
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&library=transformers.js) |
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | |
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | `text-to-speech` or `text-to-audio` | | Generating natural-sounding speech given text input. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TextToAudioPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=text-to-audio&library=transformers.js) |


#### Tabular
Expand Down
20 changes: 20 additions & 0 deletions examples/text-to-speech-client/.eslintrc.cjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
module.exports = {
root: true,
env: { browser: true, es2020: true },
extends: [
'eslint:recommended',
'plugin:react/recommended',
'plugin:react/jsx-runtime',
'plugin:react-hooks/recommended',
],
ignorePatterns: ['dist', '.eslintrc.cjs'],
parserOptions: { ecmaVersion: 'latest', sourceType: 'module' },
settings: { react: { version: '18.2' } },
plugins: ['react-refresh'],
rules: {
'react-refresh/only-export-components': [
'warn',
{ allowConstantExport: true },
],
},
}
24 changes: 24 additions & 0 deletions examples/text-to-speech-client/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*

node_modules
dist
dist-ssr
*.local

# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
8 changes: 8 additions & 0 deletions examples/text-to-speech-client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# React + Vite

This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.

Currently, two official plugins are available:

- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react/README.md) uses [Babel](https://babeljs.io/) for Fast Refresh
- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
12 changes: 12 additions & 0 deletions examples/text-to-speech-client/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Transformers.js - Text-to-speech demo</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.jsx"></script>
</body>
</html>
30 changes: 30 additions & 0 deletions examples/text-to-speech-client/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"name": "text-to-speech-client",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"lint": "eslint . --ext js,jsx --report-unused-disable-directives --max-warnings 0",
"preview": "vite preview"
},
"dependencies": {
"@xenova/transformers": "^2.7.0",
"react": "^18.2.0",
"react-dom": "^18.2.0"
},
"devDependencies": {
"@types/react": "^18.2.15",
"@types/react-dom": "^18.2.7",
"@vitejs/plugin-react": "^4.0.3",
"autoprefixer": "^10.4.16",
"eslint": "^8.45.0",
"eslint-plugin-react": "^7.32.2",
"eslint-plugin-react-hooks": "^4.6.0",
"eslint-plugin-react-refresh": "^0.4.3",
"postcss": "^8.4.31",
"tailwindcss": "^3.3.3",
"vite": "^4.4.5"
}
}
6 changes: 6 additions & 0 deletions examples/text-to-speech-client/postcss.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
export default {
plugins: {
tailwindcss: {},
autoprefixer: {},
},
}
162 changes: 162 additions & 0 deletions examples/text-to-speech-client/src/App.jsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
import React, { useState, useEffect, useRef } from 'react';

import AudioPlayer from './components/AudioPlayer';
import Progress from './components/Progress';
import { SPEAKERS, DEFAULT_SPEAKER } from './constants';

const App = () => {

// Model loading
const [ready, setReady] = useState(null);
const [disabled, setDisabled] = useState(false);
const [progressItems, setProgressItems] = useState([]);

// Inputs and outputs
const [text, setText] = useState('I love Hugging Face!');
const [selectedSpeaker, setSelectedSpeaker] = useState(DEFAULT_SPEAKER);
const [output, setOutput] = useState(null);

// Create a reference to the worker object.
const worker = useRef(null);

// We use the `useEffect` hook to setup the worker as soon as the `App` component is mounted.
useEffect(() => {
if (!worker.current) {
// Create the worker if it does not yet exist.
worker.current = new Worker(new URL('./worker.js', import.meta.url), {
type: 'module'
});
}

// Create a callback function for messages from the worker thread.
const onMessageReceived = (e) => {
switch (e.data.status) {
case 'initiate':
// Model file start load: add a new progress item to the list.
setReady(false);
setProgressItems(prev => [...prev, e.data]);
break;

case 'progress':
// Model file progress: update one of the progress items.
setProgressItems(
prev => prev.map(item => {
if (item.file === e.data.file) {
return { ...item, progress: e.data.progress }
}
return item;
})
);
break;

case 'done':
// Model file loaded: remove the progress item from the list.
setProgressItems(
prev => prev.filter(item => item.file !== e.data.file)
);
break;

case 'ready':
// Pipeline ready: the worker is ready to accept messages.
setReady(true);
break;

case 'complete':
// Generation complete: re-enable the "Translate" button
setDisabled(false);

const blobUrl = URL.createObjectURL(e.data.output);
setOutput(blobUrl);
break;
}
};

// Attach the callback function as an event listener.
worker.current.addEventListener('message', onMessageReceived);

// Define a cleanup function for when the component is unmounted.
return () => worker.current.removeEventListener('message', onMessageReceived);
});


const handleGenerateSpeech = () => {
setDisabled(true);
worker.current.postMessage({
text,
speaker_id: selectedSpeaker,
});
};

const isLoading = ready === false;
return (
<div className='min-h-screen flex items-center justify-center bg-gray-100'>
<div className='absolute gap-1 z-50 top-0 left-0 w-full h-full transition-all px-8 flex flex-col justify-center text-center' style={{
opacity: isLoading ? 1 : 0,
pointerEvents: isLoading ? 'all' : 'none',
background: 'rgba(0, 0, 0, 0.9)',
backdropFilter: 'blur(8px)',
}}>
{isLoading && (
<label className='text-white text-xl p-3'>Loading models... (only run once)</label>
)}
{progressItems.map(data => (
<div key={`${data.name}/${data.file}`}>
<Progress text={`${data.name}/${data.file}`} percentage={data.progress} />
</div>
))}
</div>
<div className='bg-white p-8 rounded-lg shadow-lg w-full max-w-xl m-2'>
<h1 className='text-3xl font-semibold text-gray-800 mb-1 text-center'>In-browser Text to Speech</h1>
<h2 className='text-base font-medium text-gray-700 mb-2 text-center'>Made with <a href='https://huggingface.co/docs/transformers.js'>🤗 Transformers.js</a></h2>
<div className='mb-4'>
<label htmlFor='text' className='block text-sm font-medium text-gray-600'>
Text
</label>
<textarea
id='text'
className='border border-gray-300 rounded-md p-2 w-full'
rows='4'
placeholder='Enter text here'
value={text}
onChange={(e) => setText(e.target.value)}
></textarea>
</div>
<div className='mb-4'>
<label htmlFor='speaker' className='block text-sm font-medium text-gray-600'>
Speaker
</label>
<select
id='speaker'
className='border border-gray-300 rounded-md p-2 w-full'
value={selectedSpeaker}
onChange={(e) => setSelectedSpeaker(e.target.value)}
>
{Object.entries(SPEAKERS).map(([key, value]) => (
<option key={key} value={value}>
{key}
</option>
))}
</select>
</div>
<div className='flex justify-center'>
<button
className={`${disabled
? 'bg-gray-400 cursor-not-allowed'
: 'bg-blue-500 hover:bg-blue-600'
} text-white rounded-md py-2 px-4`}
onClick={handleGenerateSpeech}
disabled={disabled}
>
{disabled ? 'Generating...' : 'Generate'}
</button>
</div>
{output && <AudioPlayer
audioUrl={output}
mimeType={'audio/wav'}
/>}
</div>
</div>
);
};

export default App;
26 changes: 26 additions & 0 deletions examples/text-to-speech-client/src/components/AudioPlayer.jsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import { useEffect, useRef } from "react";

export default function AudioPlayer({ audioUrl, mimeType }) {
const audioPlayer = useRef(null);
const audioSource = useRef(null);

// Updates src when url changes
useEffect(() => {
if (audioPlayer.current && audioSource.current) {
audioSource.current.src = audioUrl;
audioPlayer.current.load();
}
}, [audioUrl]);

return (
<div className='flex relative z-10 my-4 w-full'>
<audio
ref={audioPlayer}
controls
className='w-full h-14 rounded-lg bg-white shadow-xl shadow-black/5 ring-1 ring-slate-700/10'
>
<source ref={audioSource} type={mimeType}></source>
</audio>
</div>
);
}
12 changes: 12 additions & 0 deletions examples/text-to-speech-client/src/components/Progress.jsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

export default function Progress({ text, percentage }) {
percentage ??= 0;
return (
<div className="relative text-black bg-white rounded-lg text-left overflow-hidden">
<div className='px-2 w-[1%] h-full bg-blue-500 whitespace-nowrap' style={{ width: `${percentage}%` }}>
{text} ({`${percentage.toFixed(2)}%`})
</div>
</div>
);
}

11 changes: 11 additions & 0 deletions examples/text-to-speech-client/src/constants.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
export const SPEAKERS = {
"US female 1": "cmu_us_slt_arctic-wav-arctic_a0001",
"US female 2": "cmu_us_clb_arctic-wav-arctic_a0001",
"US male 1": "cmu_us_bdl_arctic-wav-arctic_a0003",
"US male 2": "cmu_us_rms_arctic-wav-arctic_a0003",
"Canadian male": "cmu_us_jmk_arctic-wav-arctic_a0002",
"Scottish male": "cmu_us_awb_arctic-wav-arctic_b0002",
"Indian male": "cmu_us_ksp_arctic-wav-arctic_a0007",
}

export const DEFAULT_SPEAKER = "cmu_us_slt_arctic-wav-arctic_a0001";
21 changes: 21 additions & 0 deletions examples/text-to-speech-client/src/index.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
@tailwind base;
@tailwind components;
@tailwind utilities;

:root {
font-family: Inter, system-ui, Avenir, Helvetica, Arial, sans-serif;
line-height: 1.5;
font-weight: 400;
color: #213547;
background-color: #ffffff;

font-synthesis: none;
text-rendering: optimizeLegibility;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
-webkit-text-size-adjust: 100%;
}

audio::-webkit-media-controls-panel {
background-color: white;
}
10 changes: 10 additions & 0 deletions examples/text-to-speech-client/src/main.jsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import React from 'react'
import ReactDOM from 'react-dom/client'
import App from './App.jsx'
import './index.css'

ReactDOM.createRoot(document.getElementById('root')).render(
<React.StrictMode>
<App />
</React.StrictMode>,
)
Loading
Loading