[Bug] `stream_chat()` does not trigger tool calling, though `chat()` does #11013

tslmy · 2024-02-20T08:49:13Z

Bug Description

I have a ReAct Agent (never tested it with an OpenAI Agent). I used to interact with it using .chat(). It was able to wield tools.

Today, I felt curious and replaced it with .stream_chat(). It stopped calling functions / using tools. It didn't even seem to bother extract a thought-action-input triplet from the LLM generation.

Version

0.10.7; 0.9.43

Steps to Reproduce

I wrote a minimal reproducing script here. It uses identical settings, same query, and fixed temperature & seed for the LLM. It runs agent.stream_chat(QUERY) first and then agent.chat(QUERY), so you can compare the different behavior.

When you run:

PYTHONPATH=. python demo_for_issue.py

you'll see:

>>>>>>>> With stream_chat:
[00:37:16] INFO     HTTP Request: POST http://localhost:11434/api/chat    _client.py:1013
                    "HTTP/1.1 200 OK"
>>>>>>>> Response:
>>>>>>>> With chat:
[00:37:18] INFO     HTTP Request: POST http://localhost:11434/api/chat    _client.py:1013
                    "HTTP/1.1 200 OK"
Thought: I need to use two tools to answer your question.
Action: about_the_user
Action Input: {'input': 'favorite drink'}

We can see that stream_chat didn't trigger the "parse a tool use" procedure, while chat did.

Update: Also observed this with OpenAILike.

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-02-20T08:49:26Z

🤖

Hey @tslmy, great to see you diving into the intricacies of LlamaIndex again! Hope all is well on your side. 😊

Thank you for bringing this to our attention. The difference in behavior between stream_chat() and chat() is due to the different ChatResponseMode used in each function.

In the chat() function, the ChatResponseMode is set to WAIT, which means the function will wait for the complete response before returning. This mode allows for the extraction of a thought-action-input triplet and the triggering of tool calling.

On the other hand, in the stream_chat() function, the ChatResponseMode is set to STREAM. This mode is designed to return responses as they are generated, without waiting for the complete response. This streaming nature of the function does not allow for the extraction of a thought-action-input triplet or the triggering of tool calling, as these require a complete response.

Here is the relevant code:

def chat(
    self,
    message: str,
    chat_history: Optional[List[ChatMessage]] = None,
    tool_choice: Union[str, dict] = "auto",
) -> AgentChatResponse:
    chat_response = self._chat(
        message, chat_history, tool_choice, mode=ChatResponseMode.WAIT
    )
    return chat_response

def stream_chat(
    self,
    message: str,
    chat_history: Optional[List[ChatMessage]] = None,
    tool_choice: Union[str, dict] = "auto",
) -> StreamingAgentChatResponse:
    chat_response = self._chat(
        message, chat_history, tool_choice, mode=ChatResponseMode.STREAM
    )
    return chat_response

In the chat() function, mode=ChatResponseMode.WAIT allows for tool calling and triplet extraction, while in the stream_chat() function, mode=ChatResponseMode.STREAM does not.

You can find this code in the llama-index-integrations/agent/llama-index-agent-openai-legacy/llama_index/agent/openai_legacy/openai_agent.py file.

I hope this clarifies the difference between the two functions. If you have any further questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich · 2024-02-20T17:12:35Z

hmm, very sus, will take a look

logan-markewich · 2024-02-21T16:50:05Z

Hey @tslmy -- I tried with both OpenAI (which worked as expected) and Ollama

Code below (for testing with OpenAI, I just commented out the line with Settings)

from llama_index.core import Settings
from llama_index.core.agent import ReActAgent
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.core.tools import FunctionTool
from llama_index.llms.ollama import Ollama

Settings.llm = Ollama(model="starling-lm", request_timeout=300)


def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b


def add(a: int, b: int) -> int:
    """Add two numbers."""
    return a + b


all_tools = [
    FunctionTool.from_defaults(fn=multiply),
    FunctionTool.from_defaults(fn=add),
]

QUERY = "What is (1242 * (5 + 3))?"

print(">>>>>>>> With stream_chat:")
agent = ReActAgent.from_tools(
    tools=all_tools,
    verbose=True,
)
response = agent.stream_chat(QUERY)
print(f">>>>>>>> Response: ", end="", flush=True)
for token in response.response_gen:
    print(token, end="", flush=True)
print()

agent = ReActAgent.from_tools(
    tools=all_tools,
    verbose=True,
)
print(">>>>>>>> With chat:")
response = agent.chat(QUERY)
print(f">>>>>>>> Response: {response.response}")

You need to iterate over the stream_chat response, putting it right into print() as you did will not consume the generator

tslmy · 2024-02-21T23:02:57Z

@logan-markewich , I'm sorry, I still can't achieve consistent behaviors across stream_chat and chat.

I used the script you shared, with only the following changes for consistency's sake:

Settings.llm = Ollama(
    model="starling-lm",
    request_timeout=300,
+    temperature=0.01,
+    seed=42,
+    additional_kwargs={"stop": ["Observation:"]},
)

And this is what I got:

Note that, with stream_chat, the LLM generation:

Thought: I need to use a tool to help me answer the question.
Action: multiply
Action Input: {"a": 1242, "b": {add: [5, 3]}

wasn't extracted as a step.

Could you share a screenshot of running this on your side, so that we can compare the color-coded console output?

Also, could you check if you're using the same versions of dependencies as my https://github.com/tslmy/agent/blob/main/poetry.lock file? For Ollama, version is 0.1.25.

logan-markewich · 2024-02-22T01:55:28Z

Seems like in this case, the LLM hallucinate the function call and result? Will take a try. It might just be a difference in how tracebacks end up getting handled

sahilshaheen · 2024-02-23T07:37:10Z

I'm having the same issue - agent returns its initial internal thought as response. I used the create-llama template with a FastAPI backend

Relevant code snippet (inside /chat endpoint):

    response = await chat_engine.astream_chat(lastMessage.content, messages) # chat_engine is a ReActAgent

    # stream response
    async def event_generator():
        async for token in response.async_response_gen():
            # If client closes connection, stop sending events
            if await request.is_disconnected():
                break
            yield token

    return StreamingResponse(event_generator(), media_type="text/plain")

versions: llama-index==0.9.48

savanth14 · 2024-02-23T12:47:22Z

@logan-markewich your code worked for me. i used a mistral 7b and here's the response:

With chat:
Thought: I need to use both the add and multiply tools to answer the question.
Action: add
Action Input: {"a": 5, "b": 3}
Observation: 8
Action: multiply
Action Input: {"a": 1242, "b": 8}
Observation: 9936
Thought: I can answer without using any more tools.
Answer: The result of 1242 * (5 + 3) is 9936.
Response: The result of 1242 * (5 + 3) is 9936.

i made a small change in the QUERY by removing the external parenthesis QUERY = "What is 1242 * (5 + 3)?" and it worked.

i am new to this. so, can you clarify a few things about function calling. can we take any open source language model and use it to build agents for function calling purposes? i previously thought this was only possible with openai models as all the documentation on llama index agents point towards how to build or modify openai agents and i couldn't find any for open source models.
do the open source model be trained for. function calling for it to effectively handle function calling during inference? i understand that the size of the model plays a big role too.\
is there any relation between function calling and frameworks like ollama and huggingface? in the llama index, sec-insights github repo, all the function calling was done via openai instances and i tried changing them to open source models but i realised that the source code files in the llms subfolder of the llama-index contains different integrations like openai, togetherai, and so on. but out of all these source code files, only openai has tool functions properly defined. the other source code files like ollama.py doesn't have these functions and have a lot of methods missing too. does this mean i cannot use function calling with these models? but again, you provided the code for ReAct Agent above. i am super confused. can you please help me understand?

No41Name · 2024-05-29T16:23:03Z

I'm facing a similar issue on a simple RAG implementation, llm is Claude 3 Haiku with temperature=0:

Why the behaviour is so different?

omrihar · 2024-05-30T13:38:26Z

I can also confirm that with a very simple RAG setup using Bedrock and a VectorStoreIndex that .chat runs the ReAct agent without a problem, but .stream_chat is not using the context at all. If I switch the mode to "context" in the index.as_chat_engine I do get answers based on the context for both .chat and .stream_chat.

Code to reproduce:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.llms.bedrock import Bedrock

llm = Bedrock(
    temperature=0,
    model='anthropic.claude-3-sonnet-20240229-v1:0',
    region_name='us-east-1',
)
embed_model = BedrockEmbedding(
    model_id="amazon.titan-embed-text-v2:0",
    region_name="us-east-1"
)

reader = SimpleDirectoryReader(
    input_dir="./data_bob",
    recursive=True,
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:       
        all_docs.append(doc)

splitter = SentenceSplitter(chunk_size=1024)
index = VectorStoreIndex.from_documents(
    all_docs, transformations=[splitter], embed_model=embed_model
)

chat_engine = index.as_chat_engine(llm=llm)

question = "Who is bob?"

response = chat_engine.chat(question)
print("Chat response\n*******")
print(response)

stream_response = chat_engine.stream_chat(question)
print("\nStream chat response\n*******")
stream_response.print_response_stream()

The data_bob directory contains a single file whose contents is:

Bob is a civil engineer whose expertise is in creating interesting projects for other people.
He has three children and a cat.

The output of the code above:

Chat response
*******
Based on the information provided, Bob is a civil engineer who specializes in creating interesting projects. He has a family with three children and also owns a pet cat.

Stream chat response
*******
 I'm sorry, but I don't have enough context to determine exactly who "Bob" is referring to. Bob is a very common name, so without any additional details about the person, I cannot provide specifics about their identity, background, occupation, etc. If you could provide some more context about which Bob you are asking about, that would help me better understand and answer your question.

omrihar · 2024-06-03T08:17:47Z

@logan-markewich Hi Logan, did you see our new comments about this issue? Perhaps this can help pin-point the issue?

logan-markewich · 2024-06-05T03:35:01Z

@omrihar try with latest llamaindex, could have been an issue with some pydantic class under the hood consuming the first token of the stream 🤷🏻

But it works fine for me

No41Name · 2024-06-06T15:49:39Z

@logan-markewich I upgraded to the latest version of llama-index (0.10.43) but still not able to make it work.
Did you run the code snippet that @omrihar provided without any modification, or did you change something?

logan-markewich · 2024-06-07T00:01:04Z

I cannot test bedrock. But using openai, anthropic, and ollama, it works fine

No41Name · 2024-06-11T12:30:29Z

@logan-markewich That's the point maybe. It's working for me too when I use openai, but not with Bedrock. Is there any other developer who can test and possibly debug the code using Bedrock? Unfortunately it's a requirement for the application I'm developing.

garritfra · 2024-06-28T09:51:23Z

EDIT: I can confirm that it's an issue with the model. I was using Mistral 7b, which gave me faulty results. Llama3 and gpt-3.5-turbo generate proper responses.

Original:

To bring some awareness, I'm also encountering this issue using a more or less stock version of the FastAPI template generated by create-llama, like in this comment.

Template-Code: https://github.com/run-llama/create-llama/blob/main/templates/types/streaming/fastapi/app/api/routers/chat.py#L41_L45

Calling .achat in the /chat/request endpoint of the template works without issues. The reasoning is printed in the logs and only the answer and annotations are returned.

.astream_chat on the other hand generates the reasoning and the answer in the response, without logging the reasoning. Additionally, I found that I never get source nodes when streaming the response. Are these two symptoms related to the issue?

llama-index = "0.10.50"
llama-index-core = "0.10.50"
llama-index-embeddings-ollama = "^0.1.2"
llama-index-llms-ollama = "^0.1.5"

jp-kh-kim · 2024-07-24T01:08:54Z

@sahilshaheen Hi, did you find solution for this issue?

I'm having the same issue - agent returns its initial internal thought as response. I used the create-llama template with a FastAPI backend

Relevant code snippet (inside /chat endpoint):
    response = await chat_engine.astream_chat(lastMessage.content, messages) # chat_engine is a ReActAgent

    # stream response
    async def event_generator():
        async for token in response.async_response_gen():
            # If client closes connection, stop sending events
            if await request.is_disconnected():
                break
            yield token

    return StreamingResponse(event_generator(), media_type="text/plain")
versions: llama-index==0.9.48

garritfra · 2024-07-24T10:27:09Z

@jp-kh-kim I found that in my case it was a problem with the system prompt. I opened a PR that got released in the latest version: #14814

You may want to see if the newest version fixes the issue for you?

jp-kh-kim · 2024-07-31T07:35:20Z

@garritfra Thanks ! Same here. The issue is resolved when I updated version ! Thx a lot :)

tslmy added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 20, 2024

logan-markewich closed this as completed Feb 21, 2024

logan-markewich reopened this Feb 22, 2024

scott-vsi mentioned this issue Feb 29, 2024

[Question]: ReAct agent does not actually call the Tool #11518

Closed

1 task

tslmy changed the title ~~[Bug]: ReAct Agent stream_chat() does not trigger tool calling, though chat() does~~ [Bug] stream_chat() does not trigger tool calling, though chat() does Oct 12, 2024

dosubot bot mentioned this issue Oct 30, 2024

[Bug]: ReActAgent + Anthropic + Streaming is Functionally Broken #16754

Closed

dosubot bot mentioned this issue Nov 29, 2024

[Question]: astream_chat is not working whereas achat is working #17113

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] `stream_chat()` does not trigger tool calling, though `chat()` does #11013

[Bug] `stream_chat()` does not trigger tool calling, though `chat()` does #11013

tslmy commented Feb 20, 2024 •

edited

Loading

dosubot bot commented Feb 20, 2024 •

edited

Loading

About Dosu

logan-markewich commented Feb 20, 2024

logan-markewich commented Feb 21, 2024 •

edited

Loading

tslmy commented Feb 21, 2024 •

edited

Loading

logan-markewich commented Feb 22, 2024 •

edited

Loading

sahilshaheen commented Feb 23, 2024 •

edited

Loading

savanth14 commented Feb 23, 2024

No41Name commented May 29, 2024 •

edited

Loading

omrihar commented May 30, 2024

omrihar commented Jun 3, 2024

logan-markewich commented Jun 5, 2024

No41Name commented Jun 6, 2024

logan-markewich commented Jun 7, 2024

No41Name commented Jun 11, 2024

garritfra commented Jun 28, 2024 •

edited

Loading

jp-kh-kim commented Jul 24, 2024

garritfra commented Jul 24, 2024

jp-kh-kim commented Jul 31, 2024

[Bug] stream_chat() does not trigger tool calling, though chat() does #11013

[Bug] stream_chat() does not trigger tool calling, though chat() does #11013

Comments

tslmy commented Feb 20, 2024 • edited Loading

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Feb 20, 2024 • edited Loading

Sources

About Dosu

logan-markewich commented Feb 20, 2024

logan-markewich commented Feb 21, 2024 • edited Loading

tslmy commented Feb 21, 2024 • edited Loading

logan-markewich commented Feb 22, 2024 • edited Loading

sahilshaheen commented Feb 23, 2024 • edited Loading

savanth14 commented Feb 23, 2024

No41Name commented May 29, 2024 • edited Loading

omrihar commented May 30, 2024

omrihar commented Jun 3, 2024

logan-markewich commented Jun 5, 2024

No41Name commented Jun 6, 2024

logan-markewich commented Jun 7, 2024

No41Name commented Jun 11, 2024

garritfra commented Jun 28, 2024 • edited Loading

jp-kh-kim commented Jul 24, 2024

garritfra commented Jul 24, 2024

jp-kh-kim commented Jul 31, 2024

[Bug] `stream_chat()` does not trigger tool calling, though `chat()` does #11013

[Bug] `stream_chat()` does not trigger tool calling, though `chat()` does #11013

tslmy commented Feb 20, 2024 •

edited

Loading

dosubot bot commented Feb 20, 2024 •

edited

Loading

logan-markewich commented Feb 21, 2024 •

edited

Loading

tslmy commented Feb 21, 2024 •

edited

Loading

logan-markewich commented Feb 22, 2024 •

edited

Loading

sahilshaheen commented Feb 23, 2024 •

edited

Loading

No41Name commented May 29, 2024 •

edited

Loading

garritfra commented Jun 28, 2024 •

edited

Loading