diff --git a/basics/uninterruptable/README.md b/basics/uninterruptable/README.md new file mode 100644 index 0000000..eab13e0 --- /dev/null +++ b/basics/uninterruptable/README.md @@ -0,0 +1,131 @@ +# Uninterruptable Agent + +A voice assistant that demonstrates non-interruptible speech behavior using LiveKit's voice agents, useful for delivering information without interruption. + +## Overview + +**Uninterruptable Agent** - A voice-enabled assistant configured to complete its responses without being interrupted by user speech, demonstrating the `allow_interruptions=False` configuration option. + +## Features + +- **Simple Configuration**: Single parameter controls interruption behavior +- **Voice-Enabled**: Built using LiveKit's voice capabilities with support for: + - Speech-to-Text (STT) using Deepgram + - Large Language Model (LLM) using OpenAI GPT-4o + - Text-to-Speech (TTS) using OpenAI + - Voice Activity Detection (VAD) disabled during agent speech + +## How It Works + +1. User connects to the LiveKit room +2. Agent automatically starts speaking a long test message +3. User attempts to interrupt by speaking +4. Agent continues speaking without stopping +5. Only after the agent finishes can the user's input be processed +6. Subsequent responses are also uninterruptible + +## Prerequisites + +- Python 3.10+ +- `livekit-agents`>=1.0 +- LiveKit account and credentials +- API keys for: + - OpenAI (for LLM and TTS capabilities) + - Deepgram (for speech-to-text) + +## Installation + +1. Clone the repository + +2. Install dependencies: + ```bash + pip install -r requirements.txt + ``` + +3. Create a `.env` file in the parent directory with your API credentials: + ``` + LIVEKIT_URL=your_livekit_url + LIVEKIT_API_KEY=your_api_key + LIVEKIT_API_SECRET=your_api_secret + OPENAI_API_KEY=your_openai_key + DEEPGRAM_API_KEY=your_deepgram_key + ``` + +## Running the Agent + +```bash +python uninterruptable.py dev +``` + +The agent will immediately start speaking a long message. Try interrupting to observe the non-interruptible behavior. + +## Architecture Details + +### Key Configuration + +The critical setting that makes this agent uninterruptible: + +```python +Agent( + instructions="...", + stt=deepgram.STT(), + llm=openai.LLM(model="gpt-4o"), + tts=openai.TTS(), + allow_interruptions=False # This prevents interruptions +) +``` + +### Behavior Comparison + +| Setting | User Speaks While Agent Talks | Result | +|---------|------------------------------|---------| +| `allow_interruptions=True` (default) | Agent stops mid-sentence | User input processed immediately | +| `allow_interruptions=False` | Agent continues speaking | User input queued until agent finishes | + +### Testing Approach + +The agent automatically generates a long response on entry to facilitate testing: +```python +self.session.generate_reply(user_input="Say something somewhat long and boring so I can test if you're interruptable.") +``` + +## Use Cases + +### When to Use Uninterruptible Agents + +1. **Legal Disclaimers**: Must be read in full without interruption +2. **Emergency Instructions**: Critical safety information +3. **Tutorial Steps**: Sequential instructions that shouldn't be skipped +4. **Terms and Conditions**: Required complete playback + + +## Implementation Patterns + +### Selective Non-Interruption + +```python +# Make only critical messages uninterruptible +async def say_critical(self, message: str): + self.allow_interruptions = False + await self.session.say(message) + self.allow_interruptions = True +``` + +## Important Considerations + +- **User Experience**: Non-interruptible agents can be frustrating if overused +- **Message Length**: Keep uninterruptible segments reasonably short +- **Clear Indication**: Consider informing users when interruption is disabled +- **Fallback Options**: Provide alternative ways to skip or pause if needed + +## Example Interaction + +``` +Agent: [Starts long message] "I'm going to tell you a very long and detailed story about..." +User: "Stop!" [Agent continues] +Agent: "...and that's why the chicken crossed the road. The moral of the story is..." +User: "Hey, wait!" [Agent still continues] +Agent: "...patience is a virtue." [Finally finishes] +User: "Finally! Can you hear me now?" +Agent: "Yes, I can hear you now. How can I help?" +``` \ No newline at end of file diff --git a/basics/uninterruptable.py b/basics/uninterruptable/uninterruptable.py similarity index 100% rename from basics/uninterruptable.py rename to basics/uninterruptable/uninterruptable.py diff --git a/pipeline-stt/keyword-detection/README.md b/pipeline-stt/keyword-detection/README.md new file mode 100644 index 0000000..270dc07 --- /dev/null +++ b/pipeline-stt/keyword-detection/README.md @@ -0,0 +1,89 @@ +## Overview + +**Keyword Detection Agent** - A voice-enabled agent that monitors user speech for predefined keywords and logs when they are detected. + +## Features + +- **Real-time Keyword Detection**: Monitors speech for specific keywords as users talk +- **Custom STT Pipeline**: Intercepts the speech-to-text pipeline to detect keywords +- **Logging System**: Logs detected keywords with proper formatting +- **Voice-Enabled**: Built using voice capabilities with support for: + - Speech-to-Text (STT) using Deepgram + - Large Language Model (LLM) using OpenAI + - Text-to-Speech (TTS) using OpenAI + - Voice Activity Detection (VAD) using Silero + +## How It Works + +1. User connects to the LiveKit room +2. Agent greets the user and starts a conversation +3. As the user speaks, the custom STT pipeline monitors for keywords +4. When keywords like "Shane", "hello", "thanks", or "bye" are detected, they are logged +5. The agent continues normal conversation while monitoring in the background +6. All speech continues to be processed by the LLM for responses + +## Prerequisites + +- Python 3.10+ +- `livekit-agents`>=1.0 +- LiveKit account and credentials +- API keys for: + - OpenAI (for LLM and TTS capabilities) + - Deepgram (for speech-to-text) + +## Installation + +1. Clone the repository + +2. Install dependencies: + ```bash + pip install -r requirements.txt + ``` + +3. Create a `.env` file in the parent directory with your API credentials: + ``` + LIVEKIT_URL=your_livekit_url + LIVEKIT_API_KEY=your_api_key + LIVEKIT_API_SECRET=your_api_secret + OPENAI_API_KEY=your_openai_key + DEEPGRAM_API_KEY=your_deepgram_key + ``` + +## Running the Agent + +```bash +python keyword_detection.py console +``` + +The agent will start a conversation and monitor for keywords in the background. Try using words like "hello", "thanks", or "bye" in your speech and watch them come up in logging. + +## Architecture Details + +### Main Classes + +- **KeywordDetectionAgent**: Custom agent class that extends the base Agent with keyword detection +- **stt_node**: Overridden method that intercepts the STT pipeline to monitor for keywords + +### Keyword Detection Pipeline + +The agent overrides the `stt_node` method to create a custom processing pipeline: +1. Receives the parent STT stream +2. Monitors final transcripts for keywords +3. Logs detected keywords +4. Passes all events through unchanged for normal processing + +### Current Keywords + +The agent monitors for these keywords (case-insensitive): +- "Shane" +- "hello" +- "thanks" +- "bye" + +### Logging Output + +When keywords are detected, you'll see log messages like: +``` +INFO:keyword-detection:Keyword detected: 'hello' +INFO:keyword-detection:Keyword detected: 'thanks' +``` \ No newline at end of file diff --git a/pipeline-stt/keyword_detection.py b/pipeline-stt/keyword-detection/keyword_detection.py similarity index 87% rename from pipeline-stt/keyword_detection.py rename to pipeline-stt/keyword-detection/keyword_detection.py index 183aeaf..2d6f24a 100644 --- a/pipeline-stt/keyword_detection.py +++ b/pipeline-stt/keyword-detection/keyword_detection.py @@ -9,14 +9,14 @@ load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env') -logger = logging.getLogger("listen-and-respond") +logger = logging.getLogger("keyword-detection") logger.setLevel(logging.INFO) -class SimpleAgent(Agent): +class KeywordDetectionAgent(Agent): def __init__(self) -> None: super().__init__( instructions=""" - You are a helpful agent. + You are a helpful agent that detects keywords in user speech. """, stt=deepgram.STT(), llm=openai.LLM(), @@ -28,7 +28,7 @@ async def on_enter(self): self.session.generate_reply() async def stt_node(self, text: AsyncIterable[str], model_settings: Optional[dict] = None) -> Optional[AsyncIterable[rtc.AudioFrame]]: - keywords = ["Shane", "hello", "thanks"] + keywords = ["Shane", "hello", "thanks", "bye"] parent_stream = super().stt_node(text, model_settings) if parent_stream is None: @@ -53,7 +53,7 @@ async def entrypoint(ctx: JobContext): session = AgentSession() await session.start( - agent=SimpleAgent(), + agent=KeywordDetectionAgent(), room=ctx.room ) diff --git a/pipeline-stt/transcriber/README.md b/pipeline-stt/transcriber/README.md new file mode 100644 index 0000000..5b3b66d --- /dev/null +++ b/pipeline-stt/transcriber/README.md @@ -0,0 +1,85 @@ +# Transcriber Agent + +A speech-to-text logging agent that transcribes user speech and saves it to a file using LiveKit's voice agents. + +## Overview + +**Transcriber Agent** - A voice-enabled agent that listens to user speech, transcribes it using Deepgram STT, and logs all transcriptions with timestamps to a local file. + +## Features + +- **Real-time Transcription**: Converts speech to text as users speak +- **Persistent Logging**: Saves all transcriptions to `user_speech_log.txt` with timestamps +- **Voice-Enabled**: Built using LiveKit's voice capabilities with support for: + - Speech-to-Text (STT) using Deepgram + - Minimal agent configuration without LLM or TTS +- **Event-Based Processing**: Uses the `user_input_transcribed` event for efficient transcript handling +- **Automatic Timestamping**: Each transcription entry includes date and time + +## How It Works + +1. User connects to the LiveKit room +2. Agent starts listening for speech input +3. Deepgram STT processes the audio stream in real-time +4. When a final transcript is ready, it triggers the `user_input_transcribed` event +5. The transcript is appended to `user_speech_log.txt` with a timestamp +6. The process continues for all subsequent speech + +## Prerequisites + +- Python 3.10+ +- `livekit-agents`>=1.0 +- LiveKit account and credentials +- API keys for: + - Deepgram (for speech-to-text) + +## Installation + +1. Clone the repository + +2. Install dependencies: + ```bash + pip install -r requirements.txt + ``` + +3. Create a `.env` file in the parent directory with your API credentials: + ``` + LIVEKIT_URL=your_livekit_url + LIVEKIT_API_KEY=your_api_key + LIVEKIT_API_SECRET=your_api_secret + DEEPGRAM_API_KEY=your_deepgram_key + ``` + +## Running the Agent + +```bash +python transcriber.py console +``` + +The agent will start listening for speech and logging transcriptions to `user_speech_log.txt` in the current directory. + +## Architecture Details + +### Main Components + +- **AgentSession**: Manages the agent lifecycle and event handling +- **user_input_transcribed Event**: Fired when Deepgram completes a transcription +- **Transcript Object**: Contains the transcript text and finality status + +### Log File Format + +Transcriptions are saved in the following format: +``` +[2024-01-15 14:30:45] Hello, this is my first transcription +[2024-01-15 14:30:52] Testing the speech to text functionality +``` + +### Minimal Agent Configuration + +This agent uses a minimal configuration without LLM or TTS: +```python +Agent( + instructions="You are a helpful assistant that transcribes user speech to text.", + stt=deepgram.STT() +) +``` \ No newline at end of file diff --git a/pipeline-stt/transcriber.py b/pipeline-stt/transcriber/transcriber.py similarity index 100% rename from pipeline-stt/transcriber.py rename to pipeline-stt/transcriber/transcriber.py diff --git a/pipeline-tts/changing_language/README.md b/pipeline-tts/changing_language/README.md new file mode 100644 index 0000000..cf5bc02 --- /dev/null +++ b/pipeline-tts/changing_language/README.md @@ -0,0 +1,135 @@ +# ElevenLabs Language Switcher Agent + +A multilingual voice assistant that dynamically switches between languages using ElevenLabs TTS and LiveKit's voice agents. + +## Overview + +**Language Switcher Agent** - A voice-enabled assistant that can seamlessly switch between multiple languages during a conversation, demonstrating dynamic TTS and STT configuration. + +## Features + +- **Dynamic Language Switching**: Change languages mid-conversation without restarting +- **Synchronized STT/TTS**: Both speech recognition and synthesis switch together +- **Multiple Language Support**: English, Spanish, French, German, and Italian +- **Native Pronunciation**: Each language uses ElevenLabs' native language models +- **Contextual Greetings**: Language-specific welcome messages after switching +- **Voice-Enabled**: Built using LiveKit's voice capabilities with support for: + - Speech-to-Text (STT) using Deepgram (multilingual) + - Large Language Model (LLM) using OpenAI GPT-4o + - Text-to-Speech (TTS) using ElevenLabs Turbo v2.5 + - Voice Activity Detection (VAD) using Silero + +## How It Works + +1. User connects and hears a greeting in English +2. User can ask the agent to switch to any supported language +3. The agent updates both TTS and STT language settings dynamically +4. A confirmation message is spoken in the new language +5. All subsequent conversation happens in the selected language +6. User can switch languages again at any time during the conversation + +## Prerequisites + +- Python 3.10+ +- `livekit-agents`>=1.0 +- LiveKit account and credentials +- API keys for: + - OpenAI (for LLM capabilities) + - Deepgram (for multilingual speech-to-text) + - ElevenLabs (for multilingual text-to-speech) + +## Installation + +1. Clone the repository + +2. Install dependencies: + ```bash + pip install -r requirements.txt + ``` + +3. Create a `.env` file in the parent directory with your API credentials: + ``` + LIVEKIT_URL=your_livekit_url + LIVEKIT_API_KEY=your_api_key + LIVEKIT_API_SECRET=your_api_secret + OPENAI_API_KEY=your_openai_key + DEEPGRAM_API_KEY=your_deepgram_key + ELEVENLABS_API_KEY=your_elevenlabs_key + ``` + +## Running the Agent + +```bash +python elevenlabs_change_language.py dev +``` + +The agent will start in English. Try saying: +- "Switch to Spanish" +- "Can you speak French?" +- "Let's talk in German" +- "Change to Italian" + +## Architecture Details + +### Language Configuration + +The agent maintains mappings for: +- **Language codes**: Standard two-letter codes (en, es, fr, de, it) +- **Language names**: Human-readable names for user feedback +- **Deepgram codes**: Some languages use region-specific codes (e.g., fr-CA for French) +- **Greetings**: Native language welcome messages + +### Dynamic Updates + +Language switching involves: +1. **TTS Update**: `self.tts.update_options(language=language_code)` +2. **STT Update**: `self.stt.update_options(language=deepgram_language)` +3. **State tracking**: Current language stored for duplicate prevention +4. **Confirmation**: Native language greeting confirms the switch + +### Function Tools + +Each language has a dedicated function tool: +- `switch_to_english()` +- `switch_to_spanish()` +- `switch_to_french()` +- `switch_to_german()` +- `switch_to_italian()` + +This approach allows the LLM to understand natural language requests like "habla español" or "parlez-vous français?" + +## Supported Languages + +| Language | Code | Deepgram Code | Example Phrase | +|----------|------|---------------|----------------| +| English | en | en | "Hello! How can I help you?" | +| Spanish | es | es | "¡Hola! ¿Cómo puedo ayudarte?" | +| French | fr | fr-CA | "Bonjour! Comment puis-je vous aider?" | +| German | de | de | "Hallo! Wie kann ich Ihnen helfen?" | +| Italian | it | it | "Ciao! Come posso aiutarti?" | + +## Possible Customizations + +1. **Add More Languages**: Extend the language mappings and add corresponding function tools +2. **Voice Selection**: Use different ElevenLabs voices for different languages +3. **Regional Variants**: Add support for regional dialects (e.g., Mexican Spanish, British English) +4. **Language Detection**: Implement automatic language detection from user speech +5. **Model Selection**: Use different ElevenLabs models for specific language pairs + +## Extra Notes + +- **ElevenLabs Model**: Uses `eleven_turbo_v2_5` which supports multiple languages +- **Deepgram Model**: Uses `nova-2-general` with language-specific parameters +- **Language Persistence**: Current language is maintained throughout the session + +## Example Conversation + +``` +Agent: "Hi there! I can speak in multiple languages..." +User: "Can you speak Spanish?" +Agent: "¡Hola! Ahora estoy hablando en español. ¿Cómo puedo ayudarte hoy?" +User: "¿Cuál es el clima?" +Agent: [Responds in Spanish about the weather] +User: "Now switch to French" +Agent: "Bonjour! Je parle maintenant en français. Comment puis-je vous aider aujourd'hui?" +``` \ No newline at end of file diff --git a/pipeline-tts/elevenlabs_change_language.py b/pipeline-tts/changing_language/elevenlabs_change_language.py similarity index 100% rename from pipeline-tts/elevenlabs_change_language.py rename to pipeline-tts/changing_language/elevenlabs_change_language.py diff --git a/pipeline-tts/tts_comparison/README.md b/pipeline-tts/tts_comparison/README.md new file mode 100644 index 0000000..ee22ee0 --- /dev/null +++ b/pipeline-tts/tts_comparison/README.md @@ -0,0 +1,154 @@ +# TTS Provider Comparison Agent + +A voice assistant that allows real-time switching between different Text-to-Speech providers to compare voice quality, latency, and characteristics using LiveKit's voice agents. + +## Overview + +**TTS Comparison Agent** - A voice-enabled assistant that dynamically switches between multiple TTS providers (Rime, ElevenLabs, Cartesia, and PlayAI) during a conversation, allowing direct comparison of different voice synthesis technologies. + +## Features + +- **Multiple TTS Providers**: Compare 4 different TTS services in one session +- **Dynamic Provider Switching**: Change voices mid-conversation via agent transfer +- **Consistent Sample Rate**: All providers use 44.1kHz for fair comparison +- **Provider Awareness**: Agent knows which TTS it's using and can discuss differences +- **Voice-Enabled**: Built using LiveKit's voice capabilities with support for: + - Speech-to-Text (STT) using Deepgram + - Large Language Model (LLM) using OpenAI GPT-4o + - Text-to-Speech (TTS) using multiple providers + - Voice Activity Detection (VAD) using Silero + +## TTS Providers included in this comparison + +### 1. Rime +- **Model**: MistV2 +- **Voice**: Abbie +- **Sample Rate**: 44.1kHz +- **Characteristics**: Natural conversational voice + +### 2. ElevenLabs +- **Model**: Eleven Multilingual V2 +- **Sample Rate**: Default (provider-managed) +- **Characteristics**: High-quality multilingual support + +### 3. Cartesia +- **Model**: Sonic Preview +- **Voice**: Custom voice ID +- **Sample Rate**: 44.1kHz +- **Characteristics**: Fast, low-latency synthesis + +### 4. PlayAI +- **Model**: PlayDialog +- **Voice**: Custom cloned voice +- **Sample Rate**: 44.1kHz +- **Characteristics**: Voice cloning capabilities + +## How It Works + +1. Session starts with the Rime TTS provider +2. Agent introduces itself using the current voice +3. User can request to switch providers (e.g., "Switch to ElevenLabs") +4. Agent transfers to a new agent instance with the requested TTS +5. New agent greets the user with the new voice +6. Process repeats for any provider comparison + +## Prerequisites + +- Python 3.10+ +- `livekit-agents`>=1.0 +- LiveKit account and credentials +- API keys for: + - OpenAI (for LLM capabilities) + - Deepgram (for speech-to-text) + - Rime (for Rime TTS) + - ElevenLabs (for ElevenLabs TTS) + - Cartesia (for Cartesia TTS) + - PlayAI (for PlayAI TTS) + +## Installation + +1. Clone the repository + +2. Install dependencies: + ```bash + pip install -r requirements.txt + ``` + +3. Create a `.env` file in the parent directory with your API credentials: + ``` + LIVEKIT_URL=your_livekit_url + LIVEKIT_API_KEY=your_api_key + LIVEKIT_API_SECRET=your_api_secret + OPENAI_API_KEY=your_openai_key + DEEPGRAM_API_KEY=your_deepgram_key + RIME_API_KEY=your_rime_key + ELEVENLABS_API_KEY=your_elevenlabs_key + CARTESIA_API_KEY=your_cartesia_key + PLAYAI_API_KEY=your_playai_key + ``` + +## Running the Agent + +```bash +python tts_comparison.py dev +``` + +Try these commands to switch between providers: +- "Switch to ElevenLabs" +- "Use the Cartesia voice" +- "Let me hear PlayAI" +- "Go back to Rime" + +## Architecture Details + +### Agent Transfer Pattern + +Each TTS provider has its own agent class: +- `RimeAgent` +- `ElevenLabsAgent` +- `CartesiaAgent` +- `PlayAIAgent` + +Switching providers involves: +1. Function tool detects switch request +2. Returns new agent instance +3. Session transfers to new agent +4. `on_enter()` method provides audio confirmation + +### Sample Rate Consistency + +All providers are configured to use 44.1kHz sample rate (where configurable) to ensure fair comparison. This prevents audio quality differences due to sample rate mismatches. + +### Provider Configuration + +Each agent maintains its own TTS configuration: +```python +tts=rime.TTS( + sample_rate=44100, + model="mistv2", + speaker="abbie" +) +``` + +## Comparison Criteria + +When testing different providers, consider: + +1. **Voice Quality**: Naturalness, clarity, pronunciation +2. **Latency**: Time from request to first audio +3. **Expressiveness**: Emotion and intonation range +4. **Language Support**: Accent and multilingual capabilities +5. **Consistency**: Voice stability across utterances +6. **Cost**: Per-character or per-second pricing + +## Example Conversation + +``` +Agent (Rime): "Hello! I'm now using the Rime TTS voice. How does it sound?" +User: "It sounds good. Can I hear ElevenLabs?" +Agent (ElevenLabs): "Hello! I'm now using the ElevenLabs TTS voice. What do you think of how I sound?" +User: "Very natural! Now try Cartesia" +Agent (Cartesia): "Hello! I'm now using the Cartesia TTS voice. How do I sound to you?" +User: "Fast response! What about PlayAI?" +Agent (PlayAI): "Hello! I'm now using the PlayAI TTS voice. What are your thoughts on how I sound?" +``` \ No newline at end of file diff --git a/pipeline-tts/tts_comparison.py b/pipeline-tts/tts_comparison/tts_comparison.py similarity index 97% rename from pipeline-tts/tts_comparison.py rename to pipeline-tts/tts_comparison/tts_comparison.py index c220427..93d7bc7 100644 --- a/pipeline-tts/tts_comparison.py +++ b/pipeline-tts/tts_comparison/tts_comparison.py @@ -11,9 +11,6 @@ load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env') -### We're not including the OpenAI TTS provider here, since it uses a different sample rate. -### See openai_tts.py for an example of how to use it. - class RimeAgent(Agent): def __init__(self) -> None: super().__init__(