Skip to content

ReasoningItem of RunStreamEvents are getting emitted out of order #1767

@waziers

Description

@waziers

Please read this first

  • Have you read the docs?Agents SDK docs - Yes
  • Have you searched for related issues? Others may have faced similar issues. - Yes

Describe the bug

ReasoningItems events are getting emitted out of order. For example, we get Tool Call Initiated, Reasoning Item, Tool Call Output which is not in line with the spec.

Debug information

  • Agents SDK version: (e.g. v0.0.3)
  • Python version (e.g. Python 3.10)

Repro steps

Take the stock streaming example script from the library and add some additional print statements.

import asyncio
import random

from agents import Agent, ItemHelpers, ModelSettings, Runner, function_tool
from openai.types.shared import Reasoning


@function_tool
def how_many_jokes() -> int:
    """Return a random integer of jokes to tell between 1 and 10 (inclusive)."""
    return random.randint(1, 10)


async def main():
    agent = Agent(
        name="Joker",
        model="gpt-5",
        model_settings=ModelSettings(
            reasoning=Reasoning(
                effort="high",
                summary="auto"
            )
        ),
        instructions="First call the `how_many_jokes` tool, then tell that many jokes.",
        tools=[how_many_jokes],
    )

    result = Runner.run_streamed(
        agent,
        input="Hello"

    )

    print("\n" + "="*60)
    print("🚀 AGENT STREAM STARTING")
    print("="*60 + "\n")

    previous_data_type = None
    reasoning_buffer = []

    async for event in result.stream_events():
        # Handle raw response events with progress indicators
        if event.type == "raw_response_event":
            data_type = event.data.type

            # Special handling for different event types
            if data_type == "reasoning":
                if data_type != previous_data_type:
                    print("🧠 [REASONING] ", end='', flush=True)
                print('▸', end='', flush=True)
            elif data_type == "tool_calls":
                if data_type != previous_data_type:
                    print("\n🔧 [TOOL CALLS] ", end='', flush=True)
                print('▸', end='', flush=True)
            elif data_type == "content":
                if data_type != previous_data_type:
                    print("\n💬 [CONTENT] ", end='', flush=True)
                print('▸', end='', flush=True)
            elif "response.output_item.done" in data_type.lower():
                if data_type != previous_data_type:
                    print(f"\n✅ [RESPONSE.OUTPUT_ITEM.DONE] ", end='', flush=True)
                    # Try to extract what kind of output item from the event
                    if hasattr(event, 'data') and hasattr(event.data, 'item'):
                        item_type = getattr(event.data.item, 'type', 'unknown')
                        print(f"({item_type}) ", end='', flush=True)
                print('▸', end='', flush=True)
            else:
                if data_type != previous_data_type:
                    print(f"\n📊 [{data_type.upper()}] ", end='', flush=True)
                print('▸', end='', flush=True)

            previous_data_type = data_type
            continue

        elif event.type == "agent_updated_stream_event":
            print(f"\n\n✅ Agent Updated: {event.new_agent.name}")
            print("-" * 40)
            continue

        elif event.type == "run_item_stream_event":
            if event.item.type == "tool_call_item":
                print(f"\n\n🛠️  TOOL CALL INITIATED")
                print("   └─ Function: ", end='')
                if hasattr(event.item, 'function_call'):
                    print(f"{event.item.function_call.name}")
                    if hasattr(event.item.function_call, 'arguments'):
                        print(f"   └─ Arguments: {event.item.function_call.arguments}")
                else:
                    print("(details pending)")

            elif event.item.type == "tool_call_output_item":
                print(f"\n📤 TOOL OUTPUT")
                print(f"   └─ Result: {event.item.output}")

            elif event.item.type == "message_output_item":
                message_text = ItemHelpers.text_message_output(event.item)
                print(f"\n\n📝 MESSAGE OUTPUT")
                print("   " + "─" * 37)
                for line in message_text.split('\n'):
                    print(f"   {line}")
                print("   " + "─" * 37)

            elif event.item.type == "reasoning_output_item":
                print(f"\n\n🤔 REASONING OUTPUT")
                if hasattr(event.item, 'reasoning'):
                    print(f"   └─ {event.item.reasoning}")
                else:
                    print("   └─ (reasoning content)")

            else:
                print(f"\n\n⚡ EVENT: {event.item.type}")
                if hasattr(event.item, '__dict__'):
                    for key, value in event.item.__dict__.items():
                        if not key.startswith('_'):
                            print(f"   └─ {key}: {value}")

        # Handle response output item done events
        elif event.type == "response_output_item_done" or "output_item_done" in str(event.type):
            # Try to determine the item type from various possible locations
            item_type = 'unknown'

            if hasattr(event, 'item') and hasattr(event.item, 'type'):
                item_type = event.item.type
            elif hasattr(event, 'data') and hasattr(event.data, 'item') and hasattr(event.data.item, 'type'):
                item_type = event.data.item.type
            elif hasattr(event, 'output_item') and hasattr(event.output_item, 'type'):
                item_type = event.output_item.type

            print(f"\n✔️  OUTPUT ITEM COMPLETE: {item_type}")

            # Provide specific details based on the output item type
            if "message" in item_type.lower():
                print("   └─ Message delivery completed")
            elif "tool" in item_type.lower() and "output" in item_type.lower():
                print("   └─ Tool execution result delivered")
            elif "tool" in item_type.lower() and "call" in item_type.lower():
                print("   └─ Tool call completed")
            elif "reasoning" in item_type.lower():
                print("   └─ Reasoning step completed")
            else:
                print(f"   └─ {item_type} completed")

        # Handle other event types
        else:
            print(f"\n📌 {event.type.upper()}")
            if hasattr(event, '__dict__'):
                for key, value in event.__dict__.items():
                    if not key.startswith('_') and key != 'type':
                        print(f"   └─ {key}: {str(value)[:100]}")  # Truncate long values

    print("\n\n" + "="*60)
    print("✨ AGENT STREAM COMPLETE")
    print("="*60 + "\n")


if __name__ == "__main__":
    asyncio.run(main())

    # Example output:
    #
    # ============================================================
    # 🚀 AGENT STREAM STARTING
    # ============================================================
    #
    # 🧠 [REASONING] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
    # 🔧 [TOOL CALLS] ▸▸▸▸▸▸
    #
    # ✅ Agent Updated: Joker
    # ----------------------------------------
    #
    # 🛠️  TOOL CALL INITIATED
    #    └─ Function: how_many_jokes
    #    └─ Arguments: {}
    #
    # 📤 TOOL OUTPUT
    #    └─ Result: 4
    #
    # 💬 [CONTENT] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
    #
    # 📝 MESSAGE OUTPUT
    #    ─────────────────────────────────────
    #    Sure, here are four jokes for you:
    #
    #    1. **Why don't skeletons fight each other?**
    #       They don't have the guts!
    #
    #    2. **What do you call fake spaghetti?**
    #       An impasta!
    #
    #    3. **Why did the scarecrow win an award?**
    #       Because he was outstanding in his field!
    #
    #    4. **Why did the bicycle fall over?**
    #       Because it was two-tired!
    #    ─────────────────────────────────────
    #
    # ============================================================
    # ✨ AGENT STREAM COMPLETE
    # ============================================================

This is the output:

============================================================
🚀 AGENT STREAM STARTING
============================================================



✅ Agent Updated: Joker
----------------------------------------

📊 [RESPONSE.CREATED] ▸
📊 [RESPONSE.IN_PROGRESS] ▸
📊 [RESPONSE.OUTPUT_ITEM.ADDED] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.ADDED] ▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DELTA] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DONE] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.DONE] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.ADDED] ▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DELTA] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DONE] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.DONE] ▸
✅ [RESPONSE.OUTPUT_ITEM.DONE] (reasoning) ▸
📊 [RESPONSE.OUTPUT_ITEM.ADDED] ▸
📊 [RESPONSE.FUNCTION_CALL_ARGUMENTS.DELTA] ▸
📊 [RESPONSE.FUNCTION_CALL_ARGUMENTS.DONE] ▸

🛠️  TOOL CALL INITIATED
   └─ Function: (details pending)

✅ [RESPONSE.OUTPUT_ITEM.DONE] (function_call) ▸
📊 [RESPONSE.COMPLETED] ▸

⚡ EVENT: reasoning_item
   └─ agent: Agent(name='Joker', handoff_description=None, tools=[FunctionTool(name='how_many_jokes', description='Return a random integer of jokes to tell between 1 and 10 (inclusive).', params_json_schema={'properties': {}, 'title': 'how_many_jokes_args', 'type': 'object', 'additionalProperties': False, 'required': []}, on_invoke_tool=<function function_tool.<locals>._create_function_tool.<locals>._on_invoke_tool at 0x10f7b2f20>, strict_json_schema=True, is_enabled=True)], mcp_servers=[], mcp_config={}, instructions='First call the `how_many_jokes` tool, then tell that many jokes.', prompt=None, handoffs=[], model='gpt-5', model_settings=ModelSettings(temperature=None, top_p=None, frequency_penalty=None, presence_penalty=None, tool_choice=None, parallel_tool_calls=None, truncation=None, max_tokens=None, reasoning=Reasoning(effort='high', generate_summary=None, summary='auto'), verbosity=None, metadata=None, store=None, include_usage=None, response_include=None, top_logprobs=None, extra_query=None, extra_body=None, extra_headers=None, extra_args=None), input_guardrails=[], output_guardrails=[], output_type=None, hooks=None, tool_use_behavior='run_llm_again', reset_tool_choice=True)
   └─ raw_item: ResponseReasoningItem(id='rs_68cb60cf85f88190b23bba85c89d5f0903d11b9d3562e6b1', summary=[Summary(text='**Planning to tell jokes**\n\nI need to follow the developer’s instruction: first, I should call the how_many_jokes tool, then share that many jokes after getting a random number between 1 and 10. The user greeted me with "Hello," but my focus is on executing the joke request properly. I need to avoid heavy formatting and just provide short, appropriate jokes. So, I’ll call the tool now and keep the jokes general and light-hearted!', type='summary_text'), Summary(text='**Preparing for the joke call**\n\nThe tool is set to return a number based on the instructions, which says it returns a random integer between 1 and 10. Once I get that number, I’ll parse the result to determine how many jokes to tell. I think it’s crucial to queue the call properly, so I’m ready to proceed with the tool now. This way, I can ensure everything runs smoothly for sharing those jokes!', type='summary_text')], type='reasoning', content=None, encrypted_content=None, status=None)
   └─ type: reasoning_item

📤 TOOL OUTPUT
   └─ Result: 8

📊 [RESPONSE.CREATED] ▸
📊 [RESPONSE.IN_PROGRESS] ▸
📊 [RESPONSE.OUTPUT_ITEM.ADDED] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.ADDED] ▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DELTA] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DONE] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.DONE] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.ADDED] ▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DELTA] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
📊 [RESPONSE.REASONING_SUMMARY_TEXT.DONE] ▸
📊 [RESPONSE.REASONING_SUMMARY_PART.DONE] ▸
✅ [RESPONSE.OUTPUT_ITEM.DONE] (reasoning) ▸
📊 [RESPONSE.OUTPUT_ITEM.ADDED] ▸
📊 [RESPONSE.CONTENT_PART.ADDED] ▸
📊 [RESPONSE.OUTPUT_TEXT.DELTA] ▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸▸
📊 [RESPONSE.OUTPUT_TEXT.DONE] ▸
📊 [RESPONSE.CONTENT_PART.DONE] ▸
✅ [RESPONSE.OUTPUT_ITEM.DONE] (message) ▸
📊 [RESPONSE.COMPLETED] ▸

⚡ EVENT: reasoning_item
   └─ agent: Agent(name='Joker', handoff_description=None, tools=[FunctionTool(name='how_many_jokes', description='Return a random integer of jokes to tell between 1 and 10 (inclusive).', params_json_schema={'properties': {}, 'title': 'how_many_jokes_args', 'type': 'object', 'additionalProperties': False, 'required': []}, on_invoke_tool=<function function_tool.<locals>._create_function_tool.<locals>._on_invoke_tool at 0x10f7b2f20>, strict_json_schema=True, is_enabled=True)], mcp_servers=[], mcp_config={}, instructions='First call the `how_many_jokes` tool, then tell that many jokes.', prompt=None, handoffs=[], model='gpt-5', model_settings=ModelSettings(temperature=None, top_p=None, frequency_penalty=None, presence_penalty=None, tool_choice=None, parallel_tool_calls=None, truncation=None, max_tokens=None, reasoning=Reasoning(effort='high', generate_summary=None, summary='auto'), verbosity=None, metadata=None, store=None, include_usage=None, response_include=None, top_logprobs=None, extra_query=None, extra_body=None, extra_headers=None, extra_args=None), input_guardrails=[], output_guardrails=[], output_type=None, hooks=None, tool_use_behavior='run_llm_again', reset_tool_choice=True)
   └─ raw_item: ResponseReasoningItem(id='rs_68cb60d854f08190a20f5645cb2533d803d11b9d3562e6b1', summary=[Summary(text="**Crafting family-friendly jokes**\n\nI’m thinking about how to create 8 clean, short, family-friendly jokes. The interface suggests avoiding heavy formatting, so I’ll consider using bullet points or numbers to keep it organized. I want to make sure there's a good variety, including puns and classic dad jokes, while keeping everything safe and avoiding any offense. Here are some candidate jokes I came up with to fit those guidelines. Let’s get creative!", type='summary_text'), Summary(text='**Finalizing family-friendly jokes**\n\nI’ve created the last two jokes to finish my list of 8. The seventh joke is, “Parallel lines have so much in common… it’s a shame they’ll never meet.” For the eighth, I decided to go with, “I ordered a chicken and an egg from Amazon. I’ll let you know which comes first.” Now I’m ready to present these jokes simply, using a numbered list without any heavy formatting. I just need to keep things friendly and straightforward!', type='summary_text')], type='reasoning', content=None, encrypted_content=None, status=None)
   └─ type: reasoning_item


📝 MESSAGE OUTPUT
   ─────────────────────────────────────
   1) I told my computer I needed a break—now it won’t stop sending me KitKat ads.
   2) Why did the scarecrow win an award? He was outstanding in his field.
   3) I used to play piano by ear, but now I use my hands.
   4) Why don’t skeletons fight each other? They don’t have the guts.
   5) I’m reading a book about anti-gravity—it's impossible to put down.
   6) Why did the math book look sad? It had too many problems.
   7) Parallel lines have so much in common… it’s a shame they’ll never meet.
   8) I ordered a chicken and an egg from Amazon. I’ll let you know which comes first.
   ─────────────────────────────────────


============================================================
✨ AGENT STREAM COMPLETE
============================================================

Expected behavior

The ReasoningItem should be emitted as soon as the RESPONSE.OUTPUT_ITEM.DONE is complete. Instead, it gets emitted sometime between the TOOL CALL INITIATED and TOOL CALL OUTPUT.

This causes many problems:

  • any UI that depends on these events will show the wrong thing
  • if you replay the events in the API you'll get errors because reasoning items can't be between tool calls

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions