Adding support for new bot-output RTVI Message: #2899

mattieruth · 2025-10-21T18:52:48Z

TTSTextFrames now include metadata about whether the text was spoken or not along with a type string to describe what the text represents: ex. "sentence", "word", "custom aggregation"
Expanded how aggregators work so that the aggregate method returns aggregated text along with the type of aggregation used to create it
Deprecated the RTVI bot-transcription event in lieu of...
Introduced support for a new bot-output event. This event is meant to be the one stop shop for communicating what the bot actually "says". It is based off TTSTextFrames to communicate both sentence by sentence (or whatever aggregation is used) as well as word by word. In addition, it will include LLMTextFrames, aggregated by sentence when tts is turned off (i.e. skip_tts is true).

Resolves pipecat-ai/pipecat-client-web#158

To help with reviewing this PR, I've written up some of the use cases we were trying to solve with these changes and how you would do so after these changes go in:

Use-Cases:

Identify code blocks, keeping them in the context, but not speaking them and treating them as a unique type so that they can be formatted or treated accordingly.

This can be accomplished by providing a custom aggregator to the TTS service that looks for text inside a set of delimeters, along with updating the LLM's system prompt to specify that code should be demarcated with these same delimeters. The TTS service should be configured to
skip any aggregate found with the given code type.

tts_aggregator = PatternPairAggregator()
tts_aggregator.add_pattern_pair(
    pattern_id="code",
    start_pattern="<code>",
    end_pattern="</code>",
    type="code",
    action=MatchAction.AGGREGATE,
)
tts = MyTTSService(
    ...
    text_aggregator=tts_aggregator,
    skip_aggregator_types=["code"],
)
...
context = LLMContext([
    {
        "role": "system",
        "content": "You are a friendly AI assistant.. yada yada.. If the user makes a request that results in a code snippet, then wrap the entire snippet in <code></code> blocks.",
    },
])

Identify speaker changes to change the tts voice. Speaker tags should be removed, but the speaker should be left in the context and not spoken.

This is essentially the same as above, but you would add a callback for when the speaker aggregate is found:
```
tts_aggregator.on_pattern_match("speaker", switchSpeaker)
```
Pipeline can switch between voice and text modes and no matter the mode, the context and the resulting bot-output is updated accordingly, flowing seamlessly and represents the full picture. The aggregation of the output should also be customizable in both modes, so that, for example, if you have customized the tts aggregation to treat a code block separately, code blocks can also be treated separately when the tts is skipped.

This is accomplished the same as above but with the addition of the new ability to set a custom aggregator on the assistant context. The assistant context now looks for LLMTextFrames with skip_tts set to True and aggregates those separately. By default it uses a SimpleTextAggregator to segregate sentence-by-sentence. Bot code should look the same as above but providing a copy of the same aggregator to the assistant context:
```
llm_aggregator = PatternPairAggregator()
llm_aggregator.add_pattern_pair(
    pattern_id="code",
    start_pattern="<code>",
    end_pattern="</code>",
    type="code",
    action=MatchAction.AGGREGATE,
)
context = LLMContext(...)
context_aggregator = LLMContextAggregatorPair(
    context=context,
    assistant_params=LLMAssistantAggregatorParams(llm_text_aggregator=llm_aggregator),
)
```
The pipeline does not have a TTS. The context and resulting bot-output should be generated from the raw llm output and the aggregation of that output should be customizable.

Maybe hacky, but you can solve this by simply setting the tts_skip flag on the llm service to true. This will cause all LLMTextFrames to be marked as tts_skip=True and thus be aggregated by the assistant context's llm_text_aggregator. See above for how to customize that aggregator.

Copilot

Pull Request Overview

This PR adds support for the new bot-output RTVI message, which consolidates bot output communication by including metadata about whether text was spoken and how it was aggregated. The changes expand the text aggregation framework to return both aggregated text and its type, deprecate the bot-transcription event in favor of bot-output, and enhance TTSTextFrame to carry spoken status and aggregation type metadata.

Key changes:

Enhanced TTSTextFrame with spoken and aggregated_by metadata fields
Modified text aggregators to return Aggregation objects containing text and type information
Introduced new RTVIBotOutputMessage to represent bot output with metadata about speech and aggregation

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/pipecat/frames/frames.py	Added `aggregated_by` and `spoken` fields to `TTSTextFrame`
src/pipecat/utils/text/base_text_aggregator.py	Introduced `Aggregation` dataclass and updated aggregator interface to return `Aggregation` objects
src/pipecat/utils/text/simple_text_aggregator.py	Updated to return `Aggregation` objects instead of strings
src/pipecat/utils/text/skip_tags_aggregator.py	Updated to return `Aggregation` objects instead of strings
src/pipecat/utils/text/pattern_pair_aggregator.py	Extended `PatternMatch` from `Aggregation` and updated to handle aggregation types
src/pipecat/services/tts_service.py	Modified to handle `Aggregation` objects and pass metadata to `TTSTextFrame`
src/pipecat/services/openai_realtime_beta/openai.py	Updated `TTSTextFrame` creation with new parameters
src/pipecat/services/openai/realtime/llm.py	Updated `TTSTextFrame` creation with new parameters
src/pipecat/services/google/gemini_live/llm.py	Updated `TTSTextFrame` creation with new parameters
src/pipecat/services/aws/nova_sonic/llm.py	Updated `TTSTextFrame` creation with new parameters
src/pipecat/processors/frameworks/rtvi.py	Introduced `RTVIBotOutputMessage` and updated LLM text handling logic
tests/test_transcript_processor.py	Updated test cases to include `aggregated_by` parameter in `TTSTextFrame`

Comments suppressed due to low confidence (1)

src/pipecat/utils/text/pattern_pair_aggregator.py:1

Corrected spelling of 'message' to 'messages'.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-21T18:53:43Z

src/pipecat/frames/frames.py

 class TTSTextFrame(TextFrame):
    """Text frame generated by Text-to-Speech services."""

+    aggregated_by: Literal["sentence", "word"] | str


The aggregated_by field should have a default value to maintain backward compatibility with existing code that creates TTSTextFrame without this parameter. Consider adding a default like aggregated_by: Literal[\"sentence\", \"word\"] | str = \"word\".

Suggested change

aggregated_by: Literal["sentence", "word"] | str

aggregated_by: Literal["sentence", "word"] | str = "word"

good point. not sure what the default should be, though. would love any of the reviewers to weigh in here :)

I vote for "word"
TTSTextFrames of the most common TTS services tend to have a word in .text value.

per discussion in maintainers meeting, i think we decided it was ok NOT to have a default.

Yes, I think that's fine. We want to force the developer specify what type of aggregation is this frame.

Instead of Literal["sentence", "word"] should we have an enum?

class AggregationType(Enum): SENTENCE = "sentence" WORD = "word"

Copilot · 2025-10-21T18:53:44Z

src/pipecat/utils/text/pattern_pair_aggregator.py

+                return [start_index, pattern_info["type"]]

-        return False
+        return None, None


The return type annotation indicates Optional[Tuple[int, str]], but this returns a tuple of two None values which doesn't match the expected single None for the Optional type. Change to return (None, None) or update the return type to Optional[Tuple[Optional[int], Optional[str]]].

Suggested change

return None, None

return None

uh. when I returned just None, my handling of the return value struggled and changing to None,None fixed it. i'll have to up my python game i guess 🤔

Copilot · 2025-10-21T18:53:44Z

src/pipecat/services/tts_service.py

+                await self.push_frame(TTSTextFrame(text, spoken=True, aggregated_by=aggregated_by))
+                await self.process_generator(self.run_tts(text))


The frame is pushed before TTS processing begins, but if TTS processing fails, the frame will have already been marked as spoken=True. Consider pushing this frame after successful TTS processing or handling potential failures appropriately.

Suggested change

await self.push_frame(TTSTextFrame(text, spoken=True, aggregated_by=aggregated_by))

await self.process_generator(self.run_tts(text))

try:

await self.process_generator(self.run_tts(text))

except Exception as e:

logger.error(f"TTS processing failed: {e}")

# Optionally, push an error frame here if desired:

# await self.push_frame(ErrorFrame(str(e)))

else:

await self.push_frame(TTSTextFrame(text, spoken=True, aggregated_by=aggregated_by))

good point. HOWEVER. timing-wise, the goal is to get this out before the tts starts spewing the words as they are spoken. For this reason, I believe we want to go ahead and push the frame first. If the run_tts fails, we have bigger problems than our "spoken=True" flag not being right.

codecov · 2025-10-23T16:27:46Z

Codecov Report

❌ Patch coverage is 69.71429% with 53 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/pipecat/processors/frameworks/rtvi.py	33.33%	18 Missing ⚠️
...t/processors/aggregators/llm_response_universal.py	60.60%	13 Missing ⚠️
src/pipecat/services/tts_service.py	38.09%	13 Missing ⚠️
src/pipecat/utils/text/pattern_pair_aggregator.py	94.54%	3 Missing ⚠️
src/pipecat/services/aws/nova_sonic/llm.py	0.00%	2 Missing ⚠️
src/pipecat/services/google/gemini_live/llm.py	0.00%	1 Missing ⚠️
src/pipecat/services/openai/realtime/llm.py	0.00%	1 Missing ⚠️
...rc/pipecat/services/openai_realtime_beta/openai.py	0.00%	1 Missing ⚠️
src/pipecat/utils/text/base_text_aggregator.py	88.88%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/pipecat/extensions/ivr/ivr_navigator.py	`82.70% <100.00%> (ø)`
src/pipecat/frames/frames.py	`88.08% <100.00%> (+0.11%)`	⬆️
src/pipecat/processors/aggregators/llm_response.py	`73.06% <100.00%> (+0.11%)`	⬆️
src/pipecat/utils/text/simple_text_aggregator.py	`95.23% <100.00%> (ø)`
src/pipecat/utils/text/skip_tags_aggregator.py	`96.15% <100.00%> (ø)`
src/pipecat/services/google/gemini_live/llm.py	`20.71% <0.00%> (ø)`
src/pipecat/services/openai/realtime/llm.py	`0.00% <0.00%> (ø)`
...rc/pipecat/services/openai_realtime_beta/openai.py	`0.00% <0.00%> (ø)`
src/pipecat/utils/text/base_text_aggregator.py	`78.26% <88.88%> (+3.26%)`	⬆️
src/pipecat/services/aws/nova_sonic/llm.py	`0.00% <0.00%> (ø)`
... and 4 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/pipecat/utils/text/base_text_aggregator.py

src/pipecat/utils/text/pattern_pair_aggregator.py

src/pipecat/processors/frameworks/rtvi.py

src/pipecat/frames/frames.py

tests/test_pattern_pair_aggregator.py

markbackman · 2025-10-29T18:33:31Z

src/pipecat/utils/text/pattern_pair_aggregator.py

+        start_pattern: str,
+        end_pattern: str,
+        type: str,
+        action: MatchAction = MatchAction.REMOVE,


Removing remove_match is a breaking change. Maybe you can keep it, mark it as deprecated, warn when using it, and map the value to MatchAction.REMOVE?

Is there a reason we can't go with this approach?

Ohh is it because it's not strictly a keyword argument, it's positional?

Copilot

Pull Request Overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T19:02:50Z

src/pipecat/processors/frameworks/rtvi.py

+    """
+
+    spoken: bool = True  # Indicates if the text has been spoken by TTS
+    aggregated_by: Optional[Literal["word", "sentence"] | str] = None


The union syntax Literal['word', 'sentence'] | str is redundant since str already includes all possible literal values. Consider simplifying to just Optional[str] = None.

Suggested change

aggregated_by: Optional[Literal["word", "sentence"] | str] = None

aggregated_by: Optional[str] = None

while it may be redundant, it makes it clear that there's a set of options that can definitely be expected.

src/pipecat/processors/aggregators/llm_response_universal.py

src/pipecat/frames/frames.py

src/pipecat/processors/aggregators/llm_response_universal.py

src/pipecat/utils/text/pattern_pair_aggregator.py

markbackman · 2025-11-03T15:22:32Z

src/pipecat/processors/aggregators/llm_response_universal.py

+        self._llm_text_aggregator: BaseTextAggregator = (
+            self._params.llm_text_aggregator or SimpleTextAggregator()
+        )
+        self._skip_tts: Optional[bool] = None


I think you can drop the Optional here. Optional is more commonly used for __init__ or function args.

Suggested change

self._skip_tts: Optional[bool] = None

self._skip_tts: bool = None

markbackman · 2025-11-03T15:25:44Z

src/pipecat/processors/aggregators/llm_response_universal.py

        self._started += 1
+        if self._skip_tts is None:
+            self._skip_tts = frame.skip_tts
+        await self._maybe_push_llm_aggregation(frame)


Is this needed? The LLMFullResponseStartFrame doesn't contain text to aggregate. Perhaps the LLMFullResponseStartFrame is for initializing state for aggregation only?

You know what, you're right. I thought I needed it for the scenario where there is dangling text that might need to be sent from a previous llm output. But sense we always send whatever is left on LLMFullResponseEndFrame, there shouldn't ever be dangling text. I'll leave in the initialization of self._skip_tts though (and add a comment).

actually, to confirm what i said above. If you get an interrupt, do you still get an LLMFullResponseEndFrame?

i guess even then, the next text frame will have the same affect of flipping and sending any dangling text... all that to say: you're right. we don't have to maybe push here :)

markbackman · 2025-11-03T15:30:03Z

src/pipecat/processors/aggregators/llm_response_universal.py

+    ):
+        aggregate = None
+        should_reset_aggregator = False
+        if self._skip_tts and not frame.skip_tts:


Of these if/else conditions, you only want one to run at any given time, right? And the ordering would be to execute with the first one encountered? If so, this might require a slightly different logic to ensure that's the case. If not, then, it's fine as is.

i'm not sure i follow, but there is definitely only one path for any given frame here 🤔

markbackman · 2025-11-03T15:39:27Z

src/pipecat/processors/frameworks/rtvi.py

+        """Handle aggregated LLM text output frames."""
+        isTTS = isinstance(frame, TTSTextFrame)
+        message = RTVIBotOutputMessage(
+            data=RTVIBotOutputMessageData(


Do we need to add a bot_output_enabled flag to turn this RTVI event on and off?

ah, probably

actually... i don't know. this is more of a replacement for bot-transcription which didn't have a flag 🤔 -- what do you think?

oh actually, it looks like it was behind the bot_tts_enabled flag (weird). and everything else has a flag. i'll add it.

markbackman · 2025-11-03T15:41:50Z

src/pipecat/processors/frameworks/rtvi.py

+        # TODO: Remove all this logic when we fully deprecate bot-transcription messages.
        self._bot_transcription += frame.text
-        if match_endofsentence(self._bot_transcription):
-            await self._push_bot_transcription()


I think you can remove the corresponding _push_bot_transcription function now that it's part of _handle_llm_text_frame. Not sure if it's used elsewhere, but it doesn't seem to be.

markbackman · 2025-11-03T15:46:35Z

src/pipecat/services/tts_service.py

            # Store if we were processing text or not so we can set it back.
            processing_text = self._processing_text
-            await self._push_tts_frames(frame.text)
+            await self._push_tts_frames(frame.text, should_speak=True, aggregated_by="word")


Is word the correct aggregated_by field here? TTSSpeakFrame is usually provided as a complete phrase, e.g.:

TTSSpeakFrame("Hello there, how are you?")

ah! makes sense. not sure why i (or maybe copilot) put "word". Changing to "sentence"

markbackman · 2025-11-03T15:55:20Z

src/pipecat/services/tts_service.py

+                    # before the audio so downstream processors know what text
+                    # is being spoken. Here, we assume this flag is used when the TTS
+                    # provider supports word timestamps and the TTSTextFrames will be
+                    # generated in the word_task_handler.


Just to clarify:

Suggested change

# generated in the word_task_handler.

# generated in the words_task_handler in the WordTTSService subclass.

markbackman · 2025-11-03T16:00:40Z

src/pipecat/services/tts_service.py

+
+            if text:
+                if not self._push_text_frames:
+                    # If we are not pushing text frames, we send a TTSTextFrame


This is tricky. In this case, the TTSService is pushing AggregatedLLMTextFrame, which means that it's pushing both AggregatedLLMTextFrame and TTSTextFrame. But, the AggregatedLLMTextFrame is ignored in the LLMAssistantAggregator, meaning that two TextFrame subclasses are pushed, only one results in assistant messages being added to the context.

And, the AggregatedLLMTextFrame is pushed by the TTSService so that the RTVIObserver can see the frame in order to emit the messages to the client.

This is pretty complicated; do I have it right? We should capture this somewhere in a comment, so that it's documented.

Now I'm questioning my understanding of what this is doing. Can you explain? Why are we pushing the AggregatedLLMTextFrame in this case`?

ok. i've updated the comment here. let me know if that's good enough or still confusing.

markbackman · 2025-11-03T16:05:40Z

src/pipecat/processors/aggregators/llm_response.py

    OpenAILLMContextFrame,
 )
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
+from pipecat.utils.text.base_text_aggregator import BaseTextAggregator


Does llm_response.py need the same changes from llm_response_universal.py to be backwards compatible? Or do we have a way to ensure that only the newest aggregator can use these features?

not sure i follow...

i think this is the last comment I have not addressed. happy to chat about it so i can understand.

per an offline discussion... no. we are only going to support the feature of being able to generate bot-output when skip_tts is True in the new universal context.

markbackman · 2025-11-03T16:06:51Z

src/pipecat/services/tts_service.py


-        if self._push_text_frames:
-            # We send the original text after the audio. This way, if we are
+        if not should_speak:


I want to confirm my understanding. You can reach this case if you're using an aggregator that aggregates text that should not be spoken right? This is different than skip_tts being set in a TextFrame. A clarifying comment could be helpful for our future selves.

correct, this essentially maps to text that was aggregated into a a type listed in the new _skip_aggregator_types list. but i think i can do better than a comment...

markbackman · 2025-11-03T16:23:16Z

src/pipecat/utils/text/pattern_pair_aggregator.py

+        return Aggregation(self._text, "sentence")

    def add_pattern_pair(
-        self, pattern_id: str, start_pattern: str, end_pattern: str, remove_match: bool = True


This is a breaking change. Instead, you should:

Leave the remove_match arg in the add_pattern_pair function, marking it as deprecated

Add a warning about the deprecation

Use the remove_match to set the REMOVE MatchAction

Is that doable?

yes. and no. the issue is that type doesn't have a default, so it has to be listed in the args first. And that in and of itself is breaking.

That leads to a follow-up question of whether "type" is required or if we can just use the existing "pattern_id", but i sure like the name "type" better for how it's used...

Per an offline chat, there is not great way to retain backwards compatibility without introducing ugly API. So instead, I leaned into the breakage and not only kept remove_match as a removed field, but replaced pattern_id with the new type -- In practice, these essentially are the same thing; a way to identify the pattern type. The key difference of thinking of it less as a unique id vs identifying the conceptual way the pattern will aggregate text (which should still be unique)

TBH, type confuses me a bit, but pattern_id makes sense. In the docstring description of it, it's even described as "Identifier for this pattern pair".

Seems like if the rename doesn't make things obviously clearer, it'd be preferable to default to breaking fewer things?

markbackman

Looks really good! Added a few comments and asked questions. This is a pretty complex addition, so we'll want to make sure there are good docstrings and comments throughout explaining how it works.

This allows any given TextFrame to be marked in a way such that it does not get added to the context. Specifically, this fixes a problem with the new AggregatedTextFrames where we need to send LLM text both in an aggregated form as well as word-by-word but avoid duplicating the text in the context.

…s LLMTextFrames when tts_skip is on

1. Added support for turning off bot-output messages with the bot_output_enabled flag 2. Cleaned up logic and comments around TTSService:_push_tts_frames to hopefully make it easier to understand 3. Other minor cleanup

… to simplify the API

kompfner · 2025-11-05T21:21:34Z

src/pipecat/frames/frames.py

+
+
+@dataclass
+class TTSTextFrame(AggregatedLLMTextFrame):


Technically you could use a TTS service without an LLM service. In which case it might just be ingesting TextFrames and not LLMTextFrames. In which case...should this base class be called AggregatedTextFrame?

Yeah, I think AggregatedTextFrame makes sense as it could be used by something that is not an LLM.

kompfner · 2025-11-05T21:38:55Z

src/pipecat/utils/text/pattern_pair_aggregator.py

-            remove_match: Whether to remove the matched content from the text.
+            action: What to do when a complete pattern is matched:
+                    - MatchAction.REMOVE: Remove the matched pattern from the text.
+                    - MatchAction.KEEP: Keep the matched pattern in the text and treat it as


kompfner · 2025-11-05T21:51:37Z

src/pipecat/utils/text/pattern_pair_aggregator.py

+            # If the pattern found is set to be aggregated, return it
+            action = self._patterns[patterns[0].type].get("action", MatchAction.REMOVE)
+            if action == MatchAction.AGGREGATE:
+                self._text = ""


Wait so if we had so far buffered something like "Here's some code <code>foo = 3</code>" then we'd only return "<code>foo = 3</code>" and lose "Here's some code"?

Or does it treat the start of a match as the end of a regular sentence, so my example should be impossible?

Ah OK I think I see that logic below.

kompfner · 2025-11-05T21:56:17Z

src/pipecat/utils/text/pattern_pair_aggregator.py

+    Parameters:
+        REMOVE: Remove the matched pattern from the text.
+        KEEP: Keep the matched pattern in the text as normal text.
+        AGGREGATE: Return the matched pattern as a separate aggregation object.


This is a really tricky concept. Might make sense to elaborate what "separate aggregation" means?

kompfner · 2025-11-05T22:15:06Z

src/pipecat/services/tts_service.py


        if text:
+            if not self._push_text_frames:
+                # In a typical pipeline, there is an assistant context aggregator


Thanks for the thorough explanation here! 👏

Still hurts my brain, though.

kompfner · 2025-11-05T22:18:13Z

src/pipecat/services/openai/realtime/llm.py

        # the output modality is "audio" (the default)
        if evt.delta:
-            await self.push_frame(TTSTextFrame(evt.delta))
+            await self.push_frame(TTSTextFrame(evt.delta, aggregated_by="sentence"))


Hm, I could've sworn that either OpenAI Realtime or Gemini Live provided output in chunks smaller than sentences...

(But maybe I'm thinking of user transcriptions)

kompfner · 2025-11-06T14:19:02Z

CHANGELOG.md

+                Sentence aggregation will continue on as if this text did not exist.
+    - `KEEP`: The delimiters will be removed, but the content between them will be kept.
+              Sentence aggregation will continue on with the internal text included.
+    - `AGGREGATE`: The delimiters will be removed and the content between will be treated


great description!

kompfner · 2025-11-06T14:21:40Z

CHANGELOG.md

+
+- Added support for aggregating `LLMTextFrame`s from within the assistant `LLMAssistantAggregator`
+  when `skip_tts` is set to `True`, generating `AggregatedLLMTextFrame`s, therefore supporting
+  `bot-output` even when TTS is turned off. You can customize the aggregator used using the new


Suggested change

`bot-output` even when TTS is turned off. You can customize the aggregator used using the new

the new RTVI `bot-output` event even when TTS is turned off. You can customize the aggregator used using the new

(as a way to gently remind the reader what 'bot-output' is here)

kompfner · 2025-11-06T14:22:22Z

CHANGELOG.md

+  when `skip_tts` is set to `True`, generating `AggregatedLLMTextFrame`s, therefore supporting
+  `bot-output` even when TTS is turned off. You can customize the aggregator used using the new
+  `llm_text_aggregator` field in the `LLMAssistantAggregatorParams`. NOTE: This feature is only
+  supported when using the new universal context.


Suggested change

supported when using the new universal context.

supported when using the new `LLMContext`.

kompfner · 2025-11-06T14:23:58Z

CHANGELOG.md

+    now deprecated in lieu of this new, more thorough, message).
+  - The new `RTVIBotOutputMessage` includes the fields:
+    - `spoken`: A boolean indicating whether the text was spoken by TTS
+    - `aggregated_by`: A string representing how the text was aggregated ("sentence", "word",


Actually the literal "custom" or "<my custom aggregation>", like in Aggregation.type? I know I can read on to get the answer, but just making a note here.

Ah, got it. It's the latter. So maybe here use "my custom aggregation", like you do elsewhere to indicate that the string is the developer-provided aggregation type string

kompfner · 2025-11-06T14:31:03Z

CHANGELOG.md

+
+- TTS flow respects aggregation metadata
+  - `TTSService` accepts a new `skip_aggregator_types` to avoid speaking certain aggregation types
+    (asnow determined/returned by the aggregator)


Suggested change

(asnow determined/returned by the aggregator)

(now determined/returned by the aggregator)

or

Suggested change

(asnow determined/returned by the aggregator)

(as now determined/returned by the aggregator)

kompfner · 2025-11-06T14:41:54Z

src/pipecat/processors/aggregators/llm_response_universal.py

            await self.push_frame(frame, direction)
        elif isinstance(frame, LLMFullResponseStartFrame):
            await self._handle_llm_start(frame)
+        elif isinstance(frame, LLMTextFrame):


Maybe worth a comment that this must appear in this if/elif chain above TextFrame handling.

kompfner · 2025-11-06T14:53:36Z

src/pipecat/processors/aggregators/llm_response_universal.py

+            aggregate = self._llm_text_aggregator.text
+            should_reset_aggregator = True
+        self._skip_tts = frame.skip_tts
+        if self._skip_tts:


For my understanding: if we're in a state where TTS is being bypassed, we're doing the aggregation work here (into sentences, etc) that we would otherwise have done in the TTS service, due to the fact that in our architecture the TTS service is usually responsible for that kind of aggregation.

Is that understanding right?

If so: I really really really wish we had landed on some other name for the work of aggregating text into sentences, etc (like TTS does) and aggregating those text chunks into context messages (the primary job of the assistant aggregator).

Calling them both "aggregation" hurts the reader's brain.

But maybe this is just the world we're in. Too many things have "aggregator" in their names already and we don't want to unleash an avalanche of renames...

Given that, maybe this file could use a little more explanatory commenting to provide an overview of this mechanism and make distinction clear between the two kinds of aggregation at play here.

first of all: your understanding is correct

second: i agree, it's confusing. unfortunately, from various discussions there's really not one single good place to say "here is where in the pipeline we should aggregate". different things want different aggregations at different times for different reasons. it was a conscious decision to have the aggregators not be processors, but rather simple utilities that can be used in any given processor. so i think we're stuck in a world with overlapping aggregators. I'll take a stab at a better name and comments...

oh. semi-related but important to note: i didn't add it for this PR, but i imagined solving filipi's use case of "what if there is no TTS in the pipeline" with the changes in this file. I figured it should be as easy as adding a new flag that's like "no_tts" that would set to turn this aggregator on for all LLMTextFrames.

though, TBH you can also do this today with this PR by simply setting the skip_tts flag to true on the llm and leaving it.

kompfner · 2025-11-06T15:06:35Z

src/pipecat/processors/frameworks/rtvi.py

        # Data frames
        elif isinstance(frame, RTVIActionFrame):
            await self._action_queue.put(frame)
        elif isinstance(frame, LLMConfigureOutputFrame):


Here we rely on LLMConfigureOutputFrame to determine if the pipeline is skipping TTS, but elsewhere (the assistant aggregator, I believe?) we were just inspecting text frames as they were coming in and determining on the fly, based on their skip_tts values, whether TTS was being skipped. Why the different approaches?

kompfner · 2025-11-06T16:33:44Z

Maybe hacky, but you can solve this by simply setting the tts_skip flag on the llm service to true.

This does seem like a tough thing for someone who's building a text-output-only pipeline to have to discover and figure out how to do...skip_tts would be an unintuitive thing to set since: a) there's no TTS service in the pipeline anyway, and b) its effect isn't to skip TTS but something unrelated, which is to make bot-output work properly...right?

kompfner · 2025-11-06T16:49:37Z

Oh just thought of one more thing: is there an onus on the developer to pass the same text aggregator to both their assistant aggregator and TTS service, or risk behavior changing when toggling between TTS and no-TTS modes?

Is this kind of misconfiguration something we could or should protect against or help the developer get right?

mattieruth requested review from aconchillo, Copilot and markbackman October 21, 2025 18:52

Copilot AI reviewed Oct 21, 2025

View reviewed changes

mattieruth force-pushed the bot-output branch 4 times, most recently from a778db7 to 276fd1b Compare October 23, 2025 16:25

mattieruth force-pushed the bot-output branch 2 times, most recently from 9a11460 to 89c7277 Compare October 28, 2025 19:57

mattieruth requested a review from Copilot October 28, 2025 20:50

Copilot AI reviewed Oct 28, 2025

View reviewed changes

markbackman reviewed Oct 29, 2025

View reviewed changes

mattieruth force-pushed the bot-output branch from 3d79718 to a0da0a5 Compare October 31, 2025 15:28

mattieruth requested a review from Copilot October 31, 2025 18:59

Copilot AI reviewed Oct 31, 2025

View reviewed changes

mattieruth force-pushed the bot-output branch from a0da0a5 to 3b711c5 Compare October 31, 2025 19:35

markbackman reviewed Nov 3, 2025

View reviewed changes

mattieruth added 8 commits November 5, 2025 14:26

Make the PatternPair action an Enum

7703b4f

codepilot review fixes

3835cd2

Move aggregation logic when skip_tts is on to the assistant aggregator

0e6aee3

Support customization over the way the assistant aggregator aggregate…

26f7b88

…s LLMTextFrames when tts_skip is on

Update Changelog

23ba588

various PR Review fixes:

10e41c7

1. Added support for turning off bot-output messages with the bot_output_enabled flag 2. Cleaned up logic and comments around TTSService:_push_tts_frames to hopefully make it easier to understand 3. Other minor cleanup

Update PatternPairAggregator patterns to replace pattern_id with type…

4dbd0ce

… to simplify the API

mattieruth force-pushed the bot-output branch 3 times, most recently from a411dbf to bf59aee Compare November 5, 2025 21:01

mattieruth added 2 commits November 5, 2025 16:02

Fix new test and str version of PatternMatch

8cc76a1

CHANGELOG improvements

cd26439

mattieruth force-pushed the bot-output branch from bf59aee to cd26439 Compare November 5, 2025 21:03

kompfner reviewed Nov 5, 2025

View reviewed changes

kompfner reviewed Nov 6, 2025

View reviewed changes

	aggregated_by: Literal["sentence", "word"] \| str
	aggregated_by: Literal["sentence", "word"] \| str = "word"

		await self.push_frame(TTSTextFrame(text, spoken=True, aggregated_by=aggregated_by))
		await self.process_generator(self.run_tts(text))

	aggregated_by: Optional[Literal["word", "sentence"] \| str] = None
	aggregated_by: Optional[str] = None

	self._skip_tts: Optional[bool] = None
	self._skip_tts: bool = None

	# generated in the word_task_handler.
	# generated in the words_task_handler in the WordTTSService subclass.

	`bot-output` even when TTS is turned off. You can customize the aggregator used using the new
	the new RTVI `bot-output` event even when TTS is turned off. You can customize the aggregator used using the new

	supported when using the new universal context.
	supported when using the new `LLMContext`.

	(asnow determined/returned by the aggregator)
	(now determined/returned by the aggregator)

	(asnow determined/returned by the aggregator)
	(as now determined/returned by the aggregator)

Adding support for new bot-output RTVI Message: #2899

Are you sure you want to change the base?

Adding support for new bot-output RTVI Message: #2899

Uh oh!

Conversation

mattieruth commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

mattieruth Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markbackman Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markbackman Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattieruth commented Oct 21, 2025 •

edited

Loading

mattieruth Oct 21, 2025 •

edited

Loading

codecov bot commented Oct 23, 2025 •

edited

Loading

markbackman Oct 29, 2025 •

edited

Loading

markbackman Nov 3, 2025 •

edited

Loading

mattieruth Nov 3, 2025 •

edited

Loading

markbackman Nov 3, 2025 •

edited

Loading

mattieruth Nov 3, 2025 •

edited

Loading

markbackman Nov 3, 2025 •

edited

Loading

markbackman Nov 3, 2025 •

edited

Loading