local audio wip, maybe mvp

still some issues cancelling the bot but other than that, it works with headphones! headphone notes
daily-co · Dec 13, 2024 · cdf0a78 · cdf0a78
1 parent 0660836
commit cdf0a78
Showing 1 changed file with 271 additions and 0 deletions.
diff --git a/002-hello-pipecat-nim-local-audio.ipynb b/002-hello-pipecat-nim-local-audio.ipynb
@@ -0,0 +1,271 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1db60caf-a890-4e62-8255-62fd691cd6e6",
+   "metadata": {},
+   "source": [
+    "# Voice AI Agents: Conversational AI Framework for the Enterprise\n",
+    "In this notebook, we walk through how to craft and deploy a voice AI bot using Pipecat AI. We illustrate the basic Pipecat flow with the `nvidia/llama-3.1-nemotron-70b-instruct` LLM model and Riva for STT (Speech-To-Text) & TTS (Text-To-Speech). However, Pipecat is not opinionated and other models and STT/TTS services can easily be used. See [Pipecat documentation](https://docs.pipecat.ai/server/services/supported-services#supported-services) for other supported services.\n",
+    "\n",
+    "Pipecat AI is an open-source framework for building voice and multimodal conversational agents. Pipecat simplifies the complex voice-to-voice AI pipeline, and lets developers build AI capabilities easily and with Open Source, commercial, and custom models. See [Pipecat's Core Concepts](https://docs.pipecat.ai/getting-started/core-concepts) for a deep dive into how it works.\n",
+    "\n",
+    "The framework was developed by Daily, a company that has provided real-time video and audio communication infrastructure since 2016. It is fully vendor neutral and is not tightly coupled to Daily's infrastructure.\n",
+    "\n",
+    "> ## 🤖🎧 Use headphones for this demo! 🎧🤖"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9b4fa7d7-88fb-4b33-8145-ee1a91e58af1",
+   "metadata": {},
+   "source": [
+    "## Step 1 - Install dependencies\n",
+    "First we set our environment.\n",
+    "\n",
+    "We use Daily for transport, OpenAI for context aggregation, Riva for TTS & TTS, and Silero for VAD (Voice Activity Detection). If using different services, for example Cartesia for TTS, one would run `pip install pipecat-ai[cartesia]`.\n",
+    "\n",
+    "> [Development note]: We're installing from the github main branch here to ensure we have the latest improvements. By the time we address feedback we'll have a new release of Pipecat and just install the pipecat parts we are using."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "718d7f76-bb78-4614-ab77-229ed3eea402",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install python-dotenv\n",
+    "%load_ext dotenv\n",
+    "%dotenv\n",
+    "\n",
+    "!pip install \"git+https://github.com/pipecat-ai/pipecat.git@main\"\n",
+    "# !pip install \"pipecat-ai[daily,local,openai,riva,silero]\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7979c5d1-97a9-42e7-9de2-88b7d31b1409",
+   "metadata": {},
+   "source": [
+    "## Step 2 - Configure local audio transport for WebRTC communication\n",
+    "- Enable audio input and output for text-to-speech playback and enable VAD"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6136649d-2d26-4ca0-93da-6f6626e97c32",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipecat.audio.vad.silero import SileroVADAnalyzer\n",
+    "from pipecat.transports.base_transport import TransportParams\n",
+    "from pipecat.transports.local.audio import LocalAudioTransport\n",
+    "\n",
+    "transport = LocalAudioTransport(\n",
+    "    TransportParams(\n",
+    "        audio_out_enabled=True,\n",
+    "        audio_in_enabled=True,\n",
+    "        vad_enabled=True,\n",
+    "        vad_analyzer=SileroVADAnalyzer(),\n",
+    "        vad_audio_passthrough=True,\n",
+    "        audio_out_is_live=True,\n",
+    "        )\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8506527e-b84c-49e1-8af4-223fdb33f582",
+   "metadata": {},
+   "source": [
+    "## Step 3 - Initialize LLM, STT, and TTS services\n",
+    "We can customize options, for example a different LLM `model` or `voice_id` for FastPitch TTS."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "623d77d5-c183-43d0-980d-fd99a2836365",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from pipecat.services.nim import NimLLMService\n",
+    "from pipecat.services.riva import FastPitchTTSService, ParakeetSTTService\n",
+    "\n",
+    "stt = ParakeetSTTService(api_key=os.getenv(\"NVIDIA_API_KEY\"))\n",
+    "\n",
+    "llm = NimLLMService(\n",
+    "    api_key=os.getenv(\"NVIDIA_API_KEY\"), model=\"meta/llama-3.1-70b-instruct\"\n",
+    ")\n",
+    "\n",
+    "tts = FastPitchTTSService(api_key=os.getenv(\"NVIDIA_API_KEY\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac150732-cbb4-4c70-8d31-cab5ae51b5fb",
+   "metadata": {},
+   "source": [
+    "## Step 4 - Define prompt and initialize context aggregator\n",
+    "Edit the prompt as desired."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d884775-c4c0-49eb-b502-d4c855cc8e3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext\n",
+    "\n",
+    "messages = [\n",
+    "    {\n",
+    "        \"role\": \"system\",\n",
+    "        \"content\": \"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way that makes a cat pun if it is possible.\",\n",
+    "    },\n",
+    "]\n",
+    "\n",
+    "context = OpenAILLMContext(messages)\n",
+    "context_aggregator = llm.create_context_aggregator(context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0752c614-a65d-4c61-965f-26d7b46f8153",
+   "metadata": {},
+   "source": [
+    "## Step 5 - Create pipeline\n",
+    "Here we align the services into a pipeline to process speech into text, send to llm, then turn the llm response text into speech."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f8620a2-4caa-40c5-88d9-8aca2743157e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipecat.pipeline.pipeline import Pipeline\n",
+    "\n",
+    "pipeline = Pipeline(\n",
+    "    [\n",
+    "        transport.input(),  # Transport user input\n",
+    "        stt,  # STT\n",
+    "        context_aggregator.user(),  # User responses\n",
+    "        llm,  # LLM\n",
+    "        tts,  # TTS\n",
+    "        transport.output(),  # Transport bot output\n",
+    "        context_aggregator.assistant(),  # Assistant spoken responses\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad9c588f-0c00-4414-984a-33da31e2803d",
+   "metadata": {},
+   "source": [
+    "## Step 6 - Create PipelineTask"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9fbadb9a-9778-4f0f-910f-5c53d117e593",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipecat.pipeline.task import PipelineParams, PipelineTask\n",
+    "\n",
+    "task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4890ce7-6a1a-4f39-b6af-9a3335ad9fcf",
+   "metadata": {},
+   "source": [
+    "## Step 7 - Create a pipeline runner\n",
+    "This manages the processing pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "87e504ab-b889-4b6a-96a1-159d42a95833",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipecat.pipeline.runner import PipelineRunner\n",
+    "\n",
+    "runner = PipelineRunner()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08998f8d-ac33-4b38-b10a-01691f81636a",
+   "metadata": {},
+   "source": [
+    "## Step 8 - Run the bot and say \"hello\"!\n",
+    "\n",
+    "The first time you run the bot, it will load weights for a voice activity model into the local Python process. This will take 10-15 seconds.  \n",
+    "The bot will wait for you to speak first.  \n",
+    "\n",
+    "> ### 🎧 Remember to use headphones!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "92a411cb-d2c8-4446-be69-b391486e853e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "await runner.run(task)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "910007a5-7800-493d-b5ec-e3bb1442cac1",
+   "metadata": {},
+   "source": [
+    "## Step 9: Stop the bot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a9fe93f-f00f-42b3-b7d3-7497c1649a43",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "await runner.cancel()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python3.12",
+   "language": "python",
+   "name": "venv"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}