Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove created LLMFullResponseEndFrames from async TTS providers #1127

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chadbailey59
Copy link
Contributor

@chadbailey59 chadbailey59 commented Feb 3, 2025

Storybot uses ElevenLabs' websocket TTS generation. It generates audio one "story page" at a time. There are three story pages and one 'story prompt' in each LLM response.

If the delay between story pages is longer than stop_frame_timeout_s (because of image generation, for example), the ElevenLabs WordTTSService will emit a TTSStoppedFrame, and then when TTS starts again, it emits a TTSStartedFrame.

But that TTSStoppedFrame also causes the creation of a LLMFullResponseEndFrame. This breaks the assistant context aggregator running downstream.

I suppose I could increase my stop_frame_timeout_s value to accommodate for the image generation pauses, but that WordTTSService property isn't exposed in the subclasses. Regardless, I think it's incorrect to generate a LLMFullResponseEndFrame here, but I admittedly don't know enough about the surrounding context.

EDIT:

My LLM's full output looks like this literal string: "This is sentence one. [break] This is sentence two. [break] This is sentence three. [break] This is the last sentence." A processor right after the LLM processor breaks it into multiple text frames, resulting in this behavior:

LLMFullResponseStartFrame()
TextFrame("This is sentence one.")
 <4-8 second delay to generate an image based on that sentence>
TextFrame("This is sentence two.")
 <4-8 second delay to generate an image based on that sentence>
TextFrame("This is sentence three.")
 <4-8 second delay to generate an image based on that sentence>
TextFrame("This is the last sentence.")
LLMFullResponseEndFrame()

But those extra LLMFullResponseEndFrames make it look like this:

LLMFullResponseStartFrame()
TextFrame("This is sentence one.")
 <4-8 second delay to generate an image based on that sentence>
LLMFullResponseEndFrame(inserted by TTS 2 seconds into image gen)
TextFrame("This is sentence two.")
 <4-8 second delay to generate an image based on that sentence>
LLMFullResponseEndFrame(inserted by TTS 2 seconds into image gen)
TextFrame("This is sentence three.")
 <4-8 second delay to generate an image based on that sentence>
LLMFullResponseEndFrame(inserted by TTS 2 seconds into image gen)
TextFrame("This is the last sentence.")
LLMFullResponseEndFrame()

The fact that we have a mismatched number of LLMFullResponseStartFrames and LLMFullResponseEndFrames is what makes me call it a bug.

@aconchillo
Copy link
Contributor

I believe the problem with this approach is that the assistant context aggregator is expecting LLMFullResponseEndFrame. I need to think about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants