feat: add streaming tool use to llama-cpp-python #71

lsorber · 2024-12-25T22:34:49Z

Changes:

✨ Enable streaming tool use for llama-cpp-python models (see below for details). The result is a more simple rag and async_rag implementation that opens the door to Agentic RAG and user-defined tools.
✨ Update the query parameter description to require that it is a single-faceted question (i.e., a non-compound question) to encourage parallel function calling for compound questions.
✅ Add fairly thorough tests for both rag and the improved chatml-function-calling chat handler.

Changes to llama-cpp-python's chatml-function-calling chat handler:

General:
a. ✨ If no system message is supplied, add an empty system message to hold the tool metadata.
b. ✨ Add function descriptions to the system message so that tool use is better informed (fixes chatml-function-callling not adding tool description to the prompt. abetlen/llama-cpp-python#1869).
c. ✨ Replace print statements relating to JSON grammars with RuntimeWarning warnings.
d. ✅ Add tests with fairly broad coverage of the different scenarios.
Case "Tool choice by user":
a. ✨ Add support for more than one function call by making this a special case of "Automatic tool choice" with a single tool (subsumes Support parallel function calls with tool_choice abetlen/llama-cpp-python#1503).
Case "Automatic tool choice -> respond with a message":
a. ✨ Use user-defined stop and max_tokens.
b. 🐛 Replace incorrect use of follow-up grammar with user-defined grammar.
Case "Automatic tool choice -> one or more function calls":
a. ✨ Add support for streaming the function calls (fixes Feature request: add support for streaming tool use abetlen/llama-cpp-python#1883).
b. ✨ Make tool calling more robust by giving the LLM an explicit way to terminate the tool calls by wrapping them in a <function_calls></function_calls> block.
c. 🐛 Add missing ":" stop token to determine whether to continue with another tool call, which prevented parallel function calling (fixes chatml-function-calling chat format fails to generate multi calls to the same tool abetlen/llama-cpp-python#1756).
d. ✨ Set temperature=0 to determine whether to continue with another tool call, similar to the initial decision on whether to call a tool.

lsorber · 2024-12-26T20:38:37Z

Upstream PR to bring these improvements to llama-cpp-python: abetlen/llama-cpp-python#1884

lsorber added 2 commits December 25, 2024 22:44

feat: add streaming tool use to llama-cpp-python

a7a8685

feat: apply llama-cpp-python streaming tool use

8c9dda8

lsorber requested a review from undo76 December 25, 2024 22:34

lsorber self-assigned this Dec 25, 2024

lsorber added 4 commits December 26, 2024 22:06

fix: improve RAG tool description

a465e03

fix: align MCP query param description

6615926

docs: update chatml-function-calling docstring

3e9e933

feat: improve search_knowledge_base description

5b340d6

lsorber merged commit c57aac1 into main Jan 5, 2025
2 checks passed

lsorber deleted the ls-streaming-tools branch January 5, 2025 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add streaming tool use to llama-cpp-python #71

feat: add streaming tool use to llama-cpp-python #71

lsorber commented Dec 25, 2024 •

edited

Loading

lsorber commented Dec 26, 2024

feat: add streaming tool use to llama-cpp-python #71

feat: add streaming tool use to llama-cpp-python #71

Conversation

lsorber commented Dec 25, 2024 • edited Loading

lsorber commented Dec 26, 2024

lsorber commented Dec 25, 2024 •

edited

Loading