Remove created LLMFullResponseEndFrames from async TTS providers #1127
+2
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Storybot uses ElevenLabs' websocket TTS generation. It generates audio one "story page" at a time. There are three story pages and one 'story prompt' in each LLM response.
If the delay between story pages is longer than
stop_frame_timeout_s
(because of image generation, for example), the ElevenLabsWordTTSService
will emit aTTSStoppedFrame
, and then when TTS starts again, it emits aTTSStartedFrame
.But that
TTSStoppedFrame
also causes the creation of aLLMFullResponseEndFrame
. This breaks the assistant context aggregator running downstream.I suppose I could increase my
stop_frame_timeout_s
value to accommodate for the image generation pauses, but that WordTTSService property isn't exposed in the subclasses. Regardless, I think it's incorrect to generate aLLMFullResponseEndFrame
here, but I admittedly don't know enough about the surrounding context.EDIT:
My LLM's full output looks like this literal string:
"This is sentence one. [break] This is sentence two. [break] This is sentence three. [break] This is the last sentence."
A processor right after the LLM processor breaks it into multiple text frames, resulting in this behavior:But those extra LLMFullResponseEndFrames make it look like this:
The fact that we have a mismatched number of LLMFullResponseStartFrames and LLMFullResponseEndFrames is what makes me call it a bug.