-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d6d466b
commit 108f28a
Showing
7 changed files
with
218 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,203 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Steering Text-to-Speech for more dynamic audio generation\n", | ||
"\n", | ||
"Our traditional [TTS APIs](https://platform.openai.com/docs/guides/text-to-speech) don't have the ability to steer the voice of the generated audio. For example, if you wanted to convert a paragraph of text to audio, you would not be able to give any specific instructions on audio generation.\n", | ||
"\n", | ||
"With [audio chat completions](https://platform.openai.com/docs/guides/audio/quickstart), you can give specific instructions before generating the audio. This allows you to tell the API to speak at different speeds, tones, and accents. With appropriate instructions, these voices can be more dynamic, natural, and context-appropriate." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Traditional TTS\n", | ||
"\n", | ||
"Traditional TTS can specify voices, but not the tone, accent, or any other contextual audio parameters." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from openai import OpenAI\n", | ||
"client = OpenAI()\n", | ||
"\n", | ||
"tts_text = \"\"\"\n", | ||
"Once upon a time, Leo the lion cub woke up to the smell of pancakes and scrambled eggs.\n", | ||
"His tummy rumbled with excitement as he raced to the kitchen. Mama Lion had made a breakfast feast!\n", | ||
"Leo gobbled up his pancakes, sipped his orange juice, and munched on some juicy berries.\n", | ||
"\"\"\"\n", | ||
"\n", | ||
"speech_file_path = \"./sounds/default_tts.mp3\"\n", | ||
"response = client.audio.speech.create(\n", | ||
" model=\"tts-1-hd\",\n", | ||
" voice=\"alloy\",\n", | ||
" input=tts_text,\n", | ||
")\n", | ||
"\n", | ||
"response.write_to_file(speech_file_path)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Chat Completions TTS\n", | ||
"\n", | ||
"With chat completions, you can give specific instructions before generating the audio. In the following example, we generate a British accent in a learning setting for children. This is particularly useful for educational applications where the voice of the assistant is important for the learning experience." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import base64\n", | ||
"\n", | ||
"speech_file_path = \"./sounds/chat_completions_tts.mp3\"\n", | ||
"completion = client.chat.completions.create(\n", | ||
" model=\"gpt-4o-audio-preview\",\n", | ||
" modalities=[\"text\", \"audio\"],\n", | ||
" audio={\"voice\": \"alloy\", \"format\": \"mp3\"},\n", | ||
" messages=[\n", | ||
" {\n", | ||
" \"role\": \"system\",\n", | ||
" \"content\": \"You are a helpful assistant that can generate audio from text. Speak in a British accent and enunciate like you're talking to a child.\",\n", | ||
" },\n", | ||
" {\n", | ||
" \"role\": \"user\",\n", | ||
" \"content\": tts_text,\n", | ||
" }\n", | ||
" ],\n", | ||
")\n", | ||
"\n", | ||
"mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)\n", | ||
"with open(speech_file_path, \"wb\") as f:\n", | ||
" f.write(mp3_bytes)\n", | ||
"\n", | ||
"speech_file_path = \"./sounds/chat_completions_tts_fast.mp3\"\n", | ||
"completion = client.chat.completions.create(\n", | ||
" model=\"gpt-4o-audio-preview\",\n", | ||
" modalities=[\"text\", \"audio\"],\n", | ||
" audio={\"voice\": \"alloy\", \"format\": \"mp3\"},\n", | ||
" messages=[\n", | ||
" {\n", | ||
" \"role\": \"system\",\n", | ||
" \"content\": \"You are a helpful assistant that can generate audio from text. Speak in a British accent and speak really fast.\",\n", | ||
" },\n", | ||
" {\n", | ||
" \"role\": \"user\",\n", | ||
" \"content\": tts_text,\n", | ||
" }\n", | ||
" ],\n", | ||
")\n", | ||
"\n", | ||
"mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)\n", | ||
"with open(speech_file_path, \"wb\") as f:\n", | ||
" f.write(mp3_bytes)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Chat Completions Multilingual TTS\n", | ||
"\n", | ||
"We can also generate audio in different language accents. In the following example, we generate audio in a specific Spanish Uruguayan accent." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Había una vez un leoncito llamado Leo que se despertó con el aroma de panqueques y huevos revueltos. Su pancita gruñía de emoción mientras corría hacia la cocina. ¡Mamá León había preparado un festín de desayuno! Leo devoró sus panqueques, sorbió su jugo de naranja y mordisqueó algunas bayas jugosas.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"completion = client.chat.completions.create(\n", | ||
" model=\"gpt-4o\",\n", | ||
" messages=[\n", | ||
" {\n", | ||
" \"role\": \"system\",\n", | ||
" \"content\": \"You are an expert translator. Translate any text given into Spanish like you are from Uruguay.\",\n", | ||
" },\n", | ||
" {\n", | ||
" \"role\": \"user\",\n", | ||
" \"content\": tts_text,\n", | ||
" }\n", | ||
" ],\n", | ||
")\n", | ||
"translated_text = completion.choices[0].message.content\n", | ||
"print(translated_text)\n", | ||
"\n", | ||
"speech_file_path = \"./sounds/chat_completions_tts_es_uy.mp3\"\n", | ||
"completion = client.chat.completions.create(\n", | ||
" model=\"gpt-4o-audio-preview\",\n", | ||
" modalities=[\"text\", \"audio\"],\n", | ||
" audio={\"voice\": \"alloy\", \"format\": \"mp3\"},\n", | ||
" messages=[\n", | ||
" {\n", | ||
" \"role\": \"system\",\n", | ||
" \"content\": \"You are a helpful assistant that can generate audio from text. Speak any text that you receive in a Uruguayan spanish accent and more slowly.\",\n", | ||
" },\n", | ||
" {\n", | ||
" \"role\": \"user\",\n", | ||
" \"content\": translated_text,\n", | ||
" }\n", | ||
" ],\n", | ||
")\n", | ||
"\n", | ||
"mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)\n", | ||
"with open(speech_file_path, \"wb\") as f:\n", | ||
" f.write(mp3_bytes)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Conclusion\n", | ||
"\n", | ||
"The ability to steer the voice of the generated audio opens up a lot of possibilities for richer audio experiences. There are many use cases such as:\n", | ||
"- **Enhanced Expressiveness**: Steerable TTS allows adjustments in tone, pitch, speed, and emotion, enabling the voice to convey different moods (e.g., excitement, calmness, urgency).\n", | ||
"- **Language learning and education**: Steerable TTS can mimic accents, inflections, and pronunciation, which is beneficial for language learners and educational applications where accurate intonation and emphasis are critical.\n", | ||
"- **Contextual Voice**: Steerable TTS adapts the voice to fit the content’s context, such as formal tones for professional documents or friendly, conversational styles for social interactions. This helps create more natural conversations in virtual assistants and chatbots.\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters