Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] stream_chat() does not trigger tool calling, though chat() does #11013

Open
tslmy opened this issue Feb 20, 2024 · 18 comments
Open

[Bug] stream_chat() does not trigger tool calling, though chat() does #11013

tslmy opened this issue Feb 20, 2024 · 18 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@tslmy
Copy link
Contributor

tslmy commented Feb 20, 2024

Bug Description

I have a ReAct Agent (never tested it with an OpenAI Agent). I used to interact with it using .chat(). It was able to wield tools.

Today, I felt curious and replaced it with .stream_chat(). It stopped calling functions / using tools. It didn't even seem to bother extract a thought-action-input triplet from the LLM generation.

Version

0.10.7; 0.9.43

Steps to Reproduce

I wrote a minimal reproducing script here. It uses identical settings, same query, and fixed temperature & seed for the LLM. It runs agent.stream_chat(QUERY) first and then agent.chat(QUERY), so you can compare the different behavior.

When you run:

PYTHONPATH=. python demo_for_issue.py

you'll see:

>>>>>>>> With stream_chat:
[00:37:16] INFO     HTTP Request: POST http://localhost:11434/api/chat    _client.py:1013
                    "HTTP/1.1 200 OK"
>>>>>>>> Response:
>>>>>>>> With chat:
[00:37:18] INFO     HTTP Request: POST http://localhost:11434/api/chat    _client.py:1013
                    "HTTP/1.1 200 OK"
Thought: I need to use two tools to answer your question.
Action: about_the_user
Action Input: {'input': 'favorite drink'}

We can see that stream_chat didn't trigger the "parse a tool use" procedure, while chat did.

Update: Also observed this with OpenAILike.

Relevant Logs/Tracbacks

No response

@tslmy tslmy added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 20, 2024
Copy link

dosubot bot commented Feb 20, 2024

🤖

Hey @tslmy, great to see you diving into the intricacies of LlamaIndex again! Hope all is well on your side. 😊

Thank you for bringing this to our attention. The difference in behavior between stream_chat() and chat() is due to the different ChatResponseMode used in each function.

In the chat() function, the ChatResponseMode is set to WAIT, which means the function will wait for the complete response before returning. This mode allows for the extraction of a thought-action-input triplet and the triggering of tool calling.

On the other hand, in the stream_chat() function, the ChatResponseMode is set to STREAM. This mode is designed to return responses as they are generated, without waiting for the complete response. This streaming nature of the function does not allow for the extraction of a thought-action-input triplet or the triggering of tool calling, as these require a complete response.

Here is the relevant code:

def chat(
    self,
    message: str,
    chat_history: Optional[List[ChatMessage]] = None,
    tool_choice: Union[str, dict] = "auto",
) -> AgentChatResponse:
    chat_response = self._chat(
        message, chat_history, tool_choice, mode=ChatResponseMode.WAIT
    )
    return chat_response

def stream_chat(
    self,
    message: str,
    chat_history: Optional[List[ChatMessage]] = None,
    tool_choice: Union[str, dict] = "auto",
) -> StreamingAgentChatResponse:
    chat_response = self._chat(
        message, chat_history, tool_choice, mode=ChatResponseMode.STREAM
    )
    return chat_response

In the chat() function, mode=ChatResponseMode.WAIT allows for tool calling and triplet extraction, while in the stream_chat() function, mode=ChatResponseMode.STREAM does not.

You can find this code in the llama-index-integrations/agent/llama-index-agent-openai-legacy/llama_index/agent/openai_legacy/openai_agent.py file.

I hope this clarifies the difference between the two functions. If you have any further questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

hmm, very sus, will take a look

@logan-markewich
Copy link
Collaborator

logan-markewich commented Feb 21, 2024

Hey @tslmy -- I tried with both OpenAI (which worked as expected) and Ollama

Code below (for testing with OpenAI, I just commented out the line with Settings)

from llama_index.core import Settings
from llama_index.core.agent import ReActAgent
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.core.tools import FunctionTool
from llama_index.llms.ollama import Ollama

Settings.llm = Ollama(model="starling-lm", request_timeout=300)


def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b


def add(a: int, b: int) -> int:
    """Add two numbers."""
    return a + b


all_tools = [
    FunctionTool.from_defaults(fn=multiply),
    FunctionTool.from_defaults(fn=add),
]

QUERY = "What is (1242 * (5 + 3))?"

print(">>>>>>>> With stream_chat:")
agent = ReActAgent.from_tools(
    tools=all_tools,
    verbose=True,
)
response = agent.stream_chat(QUERY)
print(f">>>>>>>> Response: ", end="", flush=True)
for token in response.response_gen:
    print(token, end="", flush=True)
print()

agent = ReActAgent.from_tools(
    tools=all_tools,
    verbose=True,
)
print(">>>>>>>> With chat:")
response = agent.chat(QUERY)
print(f">>>>>>>> Response: {response.response}")

You need to iterate over the stream_chat response, putting it right into print() as you did will not consume the generator

@tslmy
Copy link
Contributor Author

tslmy commented Feb 21, 2024

@logan-markewich , I'm sorry, I still can't achieve consistent behaviors across stream_chat and chat.

I used the script you shared, with only the following changes for consistency's sake:

Settings.llm = Ollama(
    model="starling-lm",
    request_timeout=300,
+    temperature=0.01,
+    seed=42,
+    additional_kwargs={"stop": ["Observation:"]},
)

And this is what I got:

image

Note that, with stream_chat, the LLM generation:

Thought: I need to use a tool to help me answer the question.
Action: multiply
Action Input: {"a": 1242, "b": {add: [5, 3]}

wasn't extracted as a step.

Could you share a screenshot of running this on your side, so that we can compare the color-coded console output?

Also, could you check if you're using the same versions of dependencies as my https://github.com/tslmy/agent/blob/main/poetry.lock file? For Ollama, version is 0.1.25.

@logan-markewich
Copy link
Collaborator

logan-markewich commented Feb 22, 2024

Seems like in this case, the LLM hallucinate the function call and result? Will take a try. It might just be a difference in how tracebacks end up getting handled

@sahilshaheen
Copy link

sahilshaheen commented Feb 23, 2024

I'm having the same issue - agent returns its initial internal thought as response. I used the create-llama template with a FastAPI backend

Relevant code snippet (inside /chat endpoint):

    response = await chat_engine.astream_chat(lastMessage.content, messages) # chat_engine is a ReActAgent

    # stream response
    async def event_generator():
        async for token in response.async_response_gen():
            # If client closes connection, stop sending events
            if await request.is_disconnected():
                break
            yield token

    return StreamingResponse(event_generator(), media_type="text/plain")

versions: llama-index==0.9.48

@savanth14
Copy link

@logan-markewich your code worked for me. i used a mistral 7b and here's the response:

With chat:
Thought: I need to use both the add and multiply tools to answer the question.
Action: add
Action Input: {"a": 5, "b": 3}
Observation: 8
Action: multiply
Action Input: {"a": 1242, "b": 8}
Observation: 9936
Thought: I can answer without using any more tools.
Answer: The result of 1242 * (5 + 3) is 9936.
Response: The result of 1242 * (5 + 3) is 9936.

i made a small change in the QUERY by removing the external parenthesis QUERY = "What is 1242 * (5 + 3)?" and it worked.

i am new to this. so, can you clarify a few things about function calling. can we take any open source language model and use it to build agents for function calling purposes? i previously thought this was only possible with openai models as all the documentation on llama index agents point towards how to build or modify openai agents and i couldn't find any for open source models.
do the open source model be trained for. function calling for it to effectively handle function calling during inference? i understand that the size of the model plays a big role too.\
is there any relation between function calling and frameworks like ollama and huggingface? in the llama index, sec-insights github repo, all the function calling was done via openai instances and i tried changing them to open source models but i realised that the source code files in the llms subfolder of the llama-index contains different integrations like openai, togetherai, and so on. but out of all these source code files, only openai has tool functions properly defined. the other source code files like ollama.py doesn't have these functions and have a lot of methods missing too. does this mean i cannot use function calling with these models? but again, you provided the code for ReAct Agent above. i am super confused. can you please help me understand?

@No41Name
Copy link

No41Name commented May 29, 2024

I'm facing a similar issue on a simple RAG implementation, llm is Claude 3 Haiku with temperature=0:

image

Why the behaviour is so different?

@omrihar
Copy link

omrihar commented May 30, 2024

I can also confirm that with a very simple RAG setup using Bedrock and a VectorStoreIndex that .chat runs the ReAct agent without a problem, but .stream_chat is not using the context at all. If I switch the mode to "context" in the index.as_chat_engine I do get answers based on the context for both .chat and .stream_chat.

Code to reproduce:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.llms.bedrock import Bedrock

llm = Bedrock(
    temperature=0,
    model='anthropic.claude-3-sonnet-20240229-v1:0',
    region_name='us-east-1',
)
embed_model = BedrockEmbedding(
    model_id="amazon.titan-embed-text-v2:0",
    region_name="us-east-1"
)

reader = SimpleDirectoryReader(
    input_dir="./data_bob",
    recursive=True,
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:       
        all_docs.append(doc)

splitter = SentenceSplitter(chunk_size=1024)
index = VectorStoreIndex.from_documents(
    all_docs, transformations=[splitter], embed_model=embed_model
)

chat_engine = index.as_chat_engine(llm=llm)

question = "Who is bob?"

response = chat_engine.chat(question)
print("Chat response\n*******")
print(response)

stream_response = chat_engine.stream_chat(question)
print("\nStream chat response\n*******")
stream_response.print_response_stream()

The data_bob directory contains a single file whose contents is:

Bob is a civil engineer whose expertise is in creating interesting projects for other people.
He has three children and a cat.

The output of the code above:

Chat response
*******
Based on the information provided, Bob is a civil engineer who specializes in creating interesting projects. He has a family with three children and also owns a pet cat.

Stream chat response
*******
 I'm sorry, but I don't have enough context to determine exactly who "Bob" is referring to. Bob is a very common name, so without any additional details about the person, I cannot provide specifics about their identity, background, occupation, etc. If you could provide some more context about which Bob you are asking about, that would help me better understand and answer your question.

@omrihar
Copy link

omrihar commented Jun 3, 2024

@logan-markewich Hi Logan, did you see our new comments about this issue? Perhaps this can help pin-point the issue?

@logan-markewich
Copy link
Collaborator

@omrihar try with latest llamaindex, could have been an issue with some pydantic class under the hood consuming the first token of the stream 🤷🏻

But it works fine for me

@No41Name
Copy link

No41Name commented Jun 6, 2024

@logan-markewich I upgraded to the latest version of llama-index (0.10.43) but still not able to make it work.
Did you run the code snippet that @omrihar provided without any modification, or did you change something?

@logan-markewich
Copy link
Collaborator

I cannot test bedrock. But using openai, anthropic, and ollama, it works fine

@No41Name
Copy link

@logan-markewich That's the point maybe. It's working for me too when I use openai, but not with Bedrock. Is there any other developer who can test and possibly debug the code using Bedrock? Unfortunately it's a requirement for the application I'm developing.

@garritfra
Copy link
Contributor

garritfra commented Jun 28, 2024

EDIT: I can confirm that it's an issue with the model. I was using Mistral 7b, which gave me faulty results. Llama3 and gpt-3.5-turbo generate proper responses.

Original:

To bring some awareness, I'm also encountering this issue using a more or less stock version of the FastAPI template generated by create-llama, like in this comment.

Template-Code: https://github.com/run-llama/create-llama/blob/main/templates/types/streaming/fastapi/app/api/routers/chat.py#L41_L45

Calling .achat in the /chat/request endpoint of the template works without issues. The reasoning is printed in the logs and only the answer and annotations are returned.

.astream_chat on the other hand generates the reasoning and the answer in the response, without logging the reasoning. Additionally, I found that I never get source nodes when streaming the response. Are these two symptoms related to the issue?

llama-index = "0.10.50"
llama-index-core = "0.10.50"
llama-index-embeddings-ollama = "^0.1.2"
llama-index-llms-ollama = "^0.1.5"

@jp-kh-kim
Copy link

@sahilshaheen Hi, did you find solution for this issue?

I'm having the same issue - agent returns its initial internal thought as response. I used the create-llama template with a FastAPI backend

Relevant code snippet (inside /chat endpoint):

    response = await chat_engine.astream_chat(lastMessage.content, messages) # chat_engine is a ReActAgent

    # stream response
    async def event_generator():
        async for token in response.async_response_gen():
            # If client closes connection, stop sending events
            if await request.is_disconnected():
                break
            yield token

    return StreamingResponse(event_generator(), media_type="text/plain")

versions: llama-index==0.9.48

@garritfra
Copy link
Contributor

@jp-kh-kim I found that in my case it was a problem with the system prompt. I opened a PR that got released in the latest version: #14814

You may want to see if the newest version fixes the issue for you?

@jp-kh-kim
Copy link

@garritfra Thanks ! Same here. The issue is resolved when I updated version ! Thx a lot :)

@tslmy tslmy changed the title [Bug]: ReAct Agent stream_chat() does not trigger tool calling, though chat() does [Bug] stream_chat() does not trigger tool calling, though chat() does Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

8 participants