Remove created LLMFullResponseEndFrames from async TTS providers #1127

chadbailey59 · 2025-02-03T22:49:57Z

Storybot uses ElevenLabs' websocket TTS generation. It generates audio one "story page" at a time. There are three story pages and one 'story prompt' in each LLM response.

If the delay between story pages is longer than stop_frame_timeout_s (because of image generation, for example), the ElevenLabs WordTTSService will emit a TTSStoppedFrame, and then when TTS starts again, it emits a TTSStartedFrame.

But that TTSStoppedFrame also causes the creation of a LLMFullResponseEndFrame. This breaks the assistant context aggregator running downstream.

I suppose I could increase my stop_frame_timeout_s value to accommodate for the image generation pauses, but that WordTTSService property isn't exposed in the subclasses. Regardless, I think it's incorrect to generate a LLMFullResponseEndFrame here, but I admittedly don't know enough about the surrounding context.

EDIT:

My LLM's full output looks like this literal string: "This is sentence one. [break] This is sentence two. [break] This is sentence three. [break] This is the last sentence." A processor right after the LLM processor breaks it into multiple text frames, resulting in this behavior:

LLMFullResponseStartFrame()
TextFrame("This is sentence one.")
 <4-8 second delay to generate an image based on that sentence>
TextFrame("This is sentence two.")
 <4-8 second delay to generate an image based on that sentence>
TextFrame("This is sentence three.")
 <4-8 second delay to generate an image based on that sentence>
TextFrame("This is the last sentence.")
LLMFullResponseEndFrame()

But those extra LLMFullResponseEndFrames make it look like this:

LLMFullResponseStartFrame()
TextFrame("This is sentence one.")
 <4-8 second delay to generate an image based on that sentence>
LLMFullResponseEndFrame(inserted by TTS 2 seconds into image gen)
TextFrame("This is sentence two.")
 <4-8 second delay to generate an image based on that sentence>
LLMFullResponseEndFrame(inserted by TTS 2 seconds into image gen)
TextFrame("This is sentence three.")
 <4-8 second delay to generate an image based on that sentence>
LLMFullResponseEndFrame(inserted by TTS 2 seconds into image gen)
TextFrame("This is the last sentence.")
LLMFullResponseEndFrame()

The fact that we have a mismatched number of LLMFullResponseStartFrames and LLMFullResponseEndFrames is what makes me call it a bug.

aconchillo · 2025-02-04T22:17:40Z

I believe the problem with this approach is that the assistant context aggregator is expecting LLMFullResponseEndFrame. I need to think about this.

remove extra LLMFullResponseEndFrame

b8e2227

chadbailey59 requested review from aconchillo and markbackman February 3, 2025 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove created LLMFullResponseEndFrames from async TTS providers #1127

Remove created LLMFullResponseEndFrames from async TTS providers #1127

chadbailey59 commented Feb 3, 2025 •

edited

Loading

aconchillo commented Feb 4, 2025

Remove created LLMFullResponseEndFrames from async TTS providers #1127

Are you sure you want to change the base?

Remove created LLMFullResponseEndFrames from async TTS providers #1127

Conversation

chadbailey59 commented Feb 3, 2025 • edited Loading

aconchillo commented Feb 4, 2025

chadbailey59 commented Feb 3, 2025 •

edited

Loading