Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Merged
merged 375 commits into from
Jan 30, 2025

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Sep 25, 2024

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Which models are supported (in their native style)?

While any model should work (w/ generic fallback using JSON schema constraints), this PR supports the native call style of a few models:

Show all templates supported by minja and which handler they use
Template Format
CohereForAI-c4ai-command-r-plus-default.jinja generic tool calls
CohereForAI-c4ai-command-r-plus-rag.jinja generic tool calls
CohereForAI-c4ai-command-r-plus-tool_use.jinja generic tool calls
MiniMaxAI-MiniMax-Text-01.jinja generic tool calls
NexaAIDev-Octopus-v2.jinja generic tool calls
NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja generic tool calls
NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja hermes 2 pro tool calls
NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja generic tool calls
NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja hermes 2 pro tool calls
NousResearch-Hermes-3-Llama-3.1-70B-default.jinja generic tool calls
NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja hermes 2 pro tool calls
OrionStarAI-Orion-14B-Chat.jinja generic tool calls
Qwen-QwQ-32B-Preview.jinja hermes 2 pro tool calls
Qwen-Qwen2-7B-Instruct.jinja generic tool calls
Qwen-Qwen2-VL-7B-Instruct.jinja generic tool calls
Qwen-Qwen2.5-7B-Instruct.jinja hermes 2 pro tool calls
Qwen-Qwen2.5-Math-7B-Instruct.jinja hermes 2 pro tool calls
TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja generic tool calls
abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja generic tool calls
bofenghuang-vigogne-2-70b-chat.jinja generic tool calls
databricks-dbrx-instruct.jinja generic tool calls
deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja generic tool calls
deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja deepseek r1 tool calls
deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja deepseek r1 tool calls
deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja deepseek r1 tool calls
deepseek-ai-DeepSeek-V2.5.jinja deepseek r1 tool calls
deepseek-ai-deepseek-coder-33b-instruct.jinja generic tool calls
google-gemma-2-2b-it.jinja generic tool calls
google-gemma-7b-it.jinja generic tool calls
indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja generic tool calls
mattshumer-Reflection-Llama-3.1-70B.jinja generic tool calls
meetkai-functionary-medium-v3.2.jinja functionary v3.2 tool calls
meta-llama-Llama-3.1-8B-Instruct.jinja llama 3.x tool calls (w/ builtin tools)
meta-llama-Llama-3.2-3B-Instruct.jinja llama 3.x tool calls
meta-llama-Llama-3.3-70B-Instruct.jinja llama 3.x tool calls (w/ builtin tools)
meta-llama-Meta-Llama-3.1-8B-Instruct.jinja llama 3.x tool calls (w/ builtin tools)
microsoft-Phi-3-medium-4k-instruct.jinja generic tool calls
microsoft-Phi-3-mini-4k-instruct.jinja generic tool calls
microsoft-Phi-3-small-8k-instruct.jinja generic tool calls
microsoft-Phi-3.5-mini-instruct.jinja generic tool calls
microsoft-Phi-3.5-vision-instruct.jinja generic tool calls
mistralai-Mistral-7B-Instruct-v0.2.jinja generic tool calls
mistralai-Mistral-Large-Instruct-2407.jinja mistral nemo tool calls
mistralai-Mistral-Large-Instruct-2411.jinja generic tool calls
mistralai-Mistral-Nemo-Instruct-2407.jinja mistral nemo tool calls
mistralai-Mixtral-8x7B-Instruct-v0.1.jinja generic tool calls
mlabonne-AlphaMonarch-7B.jinja generic tool calls
nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja llama 3.x tool calls (w/ builtin tools)
openchat-openchat-3.5-0106.jinja generic tool calls
teknium-OpenHermes-2.5-Mistral-7B.jinja generic tool calls

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Chat format: .

Any tool_calls field returned by llama-server should always conform to the JSON schema (to the extent that it uses supported features of JSON schemas), so there's no need to use any post-processor.

How to use / test

You can test tool calls as follows:

  • Get and build this PR's branch
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/tool-call
    cmake -B build -DLLAMA_CURL=1
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
  • Run llama-server w/ any model (Edited: bumped to quants / models that work w/ my agent example):

    # Native support for Llama 3.x, Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x, Firefunction v2...
    
    llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
    
    llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
    
    llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M
    # Not too strong, but YMMV:
    #   llama-server --jinja -fa -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q6_K
    
    llama-server --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M
    
    # Native support requires the right template for these GGUFs:
    
    llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \
      --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )
    
    llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \
      --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-2-Pro-Llama-3-8B )
    
    llama-server --jinja -fa -hf bartowski/firefunction-v2-GGUF -hff firefunction-v2-Q5_K_M.gguf \
      --chat-template-file <( python scripts/get_chat_template.py fireworks-ai/firellama-3-firefunction-v2 )
    
    # Generic support for any other models, e.g. Phi, Gemma, really anything goes
    
    llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0
    ...
  • Call the chat completions endpoint (in non-streamed mode) with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ]
    }'

It will output something like (once piped in jq):

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "content": "",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "python",
              "arguments": "{\"code\":\"print('Hello, World!')\"}"
            },
            "id": null
          }
        ],
        "role": "assistant"
      }
    }
  ],
  ...
}

I've also created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).

Background

This PR tackles two main problems related to tool calling:

  • Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.

    • Solved w/ lazy grammars activated by trigger words (similar to stop words, but awaited in the grammar implementation itself). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).

      • For Llama 3.x (cf. these docs: 1, 2, 3), triggers are

        • <|python_tag|> if any of the builtin tools are detected (wolfram_alpha, brave_search / web_search with query param, code_interpreter with code param); NOT for Llama 3.2
        • {"name": "toolN" (for each toolN in the list of tools in the request)
        • Also just {"name": (needed for very small 1B/3B models which get confused very quickly otherwise), and some other variations (to allow the somewhat popular {"type": "function", "name": ...)
      • For Functionary v3.1, we trigger on <function= and <|python_tag|> (NOTE: seems to work well w/ Llama-3.1-Instruct, e.g. it's on together.ai's docs). Note that <|python_tag|> here introduces freeform Python code, whereas for Llama-3.1-Instruct's template it introduces builtin tool calls in Python syntax. Almost the same, but handled quite differently.

      • For Functionary v3.2, it's >>>toolN\n for each toolN (technically also triggering on toolN\n for the first tool call, there's a todo to avoid spurious matches by forcing a match at the very start)

      • For Hermes Pro (cf. Hermes-Function-Calling repo), the trigger is <tool_call>.

      • For Mistral Nemo, the trigger is the special [TOOL_CALLS] token

      • For DeepSeek R1 and its distills, it's <|tool▁calls▁begin|> (Note: DeepSeek-R1 seems more eager to talk than to call tools for now, lemme know if you get it to work)

      • For Firefunction v2, the trigger is functools[

      • For other models ("generic" chat format), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)

  • Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.

    • Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the main parts of this PR:

  • minja.hpp: minimal Jinja templating engine and its tests against actual templates & a few test contexts

  • Tool call grammar generation + output parsing logic for 8 different tool call styles (covering most of the popular models, incl. Llama 3.x, Functionary 3, Qwen 2.5, DeepSeek R1, Mistral Nemo...), with a generic fallback.

  • Lazy grammar wired into the sampler, using a mix of trigger words and trigger tokens to enable the grammar. Trigger tokens are also used to override printability of special tokens, even when the grammar is not lazy (e.g. when "tool_choice": "required" is passed in the request)

  • Integration with llama-server (full tools & tool_choice support).

TODOs

Blocking:

  • sync: minja #11499 (this PR's diff won't include chat-template.hpp or minja.hpp)
    • Ensure tools aren't described twice in the generic handler (now that Minja does it for us)
  • Add test for lazy grammars (cf. removed test-antiprompts.cpp)
  • Test parsers on corner case inputs (now they're easier to call w/ an enum) and tighten their implementations
  • Drop legacy python_code_argument_name in favour of expect_tool_arguments

Nice to haves:

  • Implement at_first semantics to require trigger word to be at start of output (equiv. to ^ regex behaviour; not using regexes as ^ can't be made to mean "start of entire string" reliably afaict), to reduce spurious triggers w/ Llama 3.x
  • Document llama3.1 builtin tools schemas
  • May want to ping owners of models which GGUF doesn't contain the right chat templates + provide them w/ an easy one-liner to surgically edit the gguf
  • Warning log when using the generic chat format
  • Find examples of tool call w/ DeepSeek-R1-Distill-* (ought to work, but proving elusive / just wants to think, think, think)
  • Implement strftime_now in minja (for Llama 3.2), also update today's date for Llama 3.1 and functionary
See draft-times TODOs
  • [ ] Support streaming (of content - as long as it doesn't trigger any partial antiprompt match - and of individual tool calls)
  • Fix CI build (non-slow tests take too long?)
  • Functionary v3.2: strip leading "all\n" in non-tool-call outputs for
  • Implement builtin_tools for Llama 3.1
  • Support DeepSeek-R1-Distill*
  • Add support for broken templates (GML3..., Command R Plus, DeepSeek)
  • [ ] e2e tests for agent
  • [ ] Add Google search tool as alternative to Brave
  • Simplify stop word / trigger word logic (push down to grammar)
  • Fix regression requiring --special for Nemo since last merge
  • Move minja to its own location w/ fuller testing (fuzzing, etc) or at least its own PR --> https://github.com/google/minja
  • Port former behave / feature tool call tests to new pytest setup (server : replace behave with pytest #10416)
  • Nemo: handle special [TOOL_CALLS] token
  • Qwen2.5-72B-Instruct
  • Llama: suspicious early terminations in hello world tests w/ using explicit python tool w/ json output (could be a failure to escape strings?). Also, need to keep special <|python_tag|> token
  • Bring back generic thoughtful_steps tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)
  • Add support for {"type": "code_interpreter"} (special-cased by functionary-medium-v3.1's template), maybe using ipython automatically for llama 3.1
  • Support jinja templates that explode on system prompts (replicate current chat template handling that puts system in user)
  • Add more tests (heavy e2e w/ actual models, tool_choice = none, parallel tool call, etc)
  • Add configurable network isolation of tools w/ a proxy (also caches pip & deb packages & limits access to host)
  • KV cache saving / reuse (within session & beyond) in agent (--cache-prompt defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instance
  • Add tool call grammar tests (although indirectly covered by server "required" test cases)
  • Add more tools (brave search) + agent examples
  • Refactorings?
    • Ideally would pass some kind of ChatHandler between OAI init & final callback, and make it handle streaming / non streaming cases? (should parallel tool calls be streamed?)
    • chat_template should maybe be resolved earlier? (now a llama_chat_template class)
    • llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct? (have introduced a new C++ API llama_chat_template::apply)
    • llama_token_to_piece(ctx, token) should really take (model, token) instead, but that's a breaking API change
      • calls common-local _llama_token_to_piece that takes model. Moved llama_chat_template_from_model helper to common.cpp
  • Fix functionary-medium-* templates' golden generation
  • Add examples to server readme
  • Support key-value overrides for templates (e.g. builtin_tools and todays_date in llama3.1's template)
    • Done by tool call handler, not user-configurable
  • Unify test-chat-templates & test-minja (write each test case in a .jinja file)
    • Fix a couple of missing bos_token in the current chat template logic
  • Bring back agent / tool call loop example + python tools isolation in docker (examples/tool-call) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389
  • Test w/ meetkai/functionary-small-v3.2

Possible follow ups:

  • Add -hft / --hf_template flag to override the GGUF's chat templates from a HF model repo
  • Add agent example w/ isolation in c++ or python (see example/agent moved from this PR to that Gist).
  • Add agent w/ MCP support?
  • Add tool call loop to the default web chat using Pyodide as a python interpreter?
  • Add tool call loop to the CLIs?

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented Sep 27, 2024

Apologies for this PR being a moving target.

I've now stabilized things (except older gcc giving me sweats), added tests & included basic usage instructions (w/ a tiny agent helper adapted from #6389) for Llama-3.1-8B-Instruct, Hermes-2-Pro-Llama-3-8B and functionary-small-3.2 (which still needs a bit of work).

@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 28, 2024
@rujialiu
Copy link

@ochafik Your minja.hpp is cool (I like minimalist things) but if for any reason you need a lightweight but more powerful template engine, you can have a look at inja (https://github.com/pantor/inja), which I've used in production for several years. It has a single-file header, and the only dependency is nlohman json, which is already used in llama.cpp.

BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly. I'll definitely give it a try when you finish this PR. Exciting work!

@ochafik
Copy link
Collaborator Author

ochafik commented Sep 29, 2024

@ochafik Your minja.hpp is cool (I like minimalist things)

Thanks @rujialiu !

but if for any reason you need a lightweight but more powerful template engine, you can have a look at inja (https://github.com/pantor/inja), which I've used in production for several years. It has a single-file header, and the only dependency is nlohman json, which is already used in llama.cpp.

Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features, e.g. NousResearch/Hermes-3-Llama-3.1, Cohere/command-r-plus, meetkai/functionary-medium-v3.2 ). Filters (w/ the pipe syntax, e.g. {{ range(10) | length }}, macros are glaring omissions for instance.

BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly.

Yeah I'm doing the same, that's why I spent so much energy improving the JSON schema support tbh.

I'll definitely give it a try when you finish this PR. Exciting work!

Hopefully soon! (famous last words haha)

@rujialiu
Copy link

Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features

Ouch, I was not aware of that. That's crazy. Now I'm really impressed that your little code already supports these. Maybe I should use your minja.hpp in production instead in the future 8-)

@github-actions github-actions bot added the script Script related label Oct 2, 2024
@Maximilian-Winter
Copy link
Contributor

@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow.

@ochafik
Copy link
Collaborator Author

ochafik commented Oct 17, 2024

@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅)

I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next.

Depending on your timezone, happy to jump into a video chat too :-) (DM on x?)

(Also, llama-cpp-agent looks suuuper cool! 💜)

@Maximilian-Winter
Copy link
Contributor

@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is [email protected]

@ochafik ochafik changed the title Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine Oct 24, 2024
@ggerganov
Copy link
Member

Add tool call loop to the default web chat using Pyodide as a python interpreter?

@ochafik This functionality would very cool to explore. I'm not familiar with Pyodide, but if it is something lightweight that would allow tool usage with the existing web ui it's worth a shot. The tool that I am looking forward the most is to be able to OCR my screen and provide the contents to the request.

@Kreijstal
Copy link

Add tool call loop to the default web chat using Pyodide as a python interpreter?

@ochafik This functionality would very cool to explore. I'm not familiar with Pyodide, but if it is something lightweight that would allow tool usage with the existing web ui it's worth a shot. The tool that I am looking forward the most is to be able to OCR my screen and provide the contents to the request.

isnt that expensive, model has to communicate to browser, browser executes pydiode, then model has to read what pydiode outputs.

@benhaotang
Copy link

benhaotang commented Feb 3, 2025

Add tool call loop to the default web chat using Pyodide as a python interpreter?

@ochafik This functionality would very cool to explore. I'm not familiar with Pyodide, but if it is something lightweight that would allow tool usage with the existing web ui it's worth a shot. The tool that I am looking forward the most is to be able to OCR my screen and provide the contents to the request.

I think with tool calling you can achieve this by adding:

async function execute_python({ code, packages }) {
  async function _loadScript(url) {
    if (!window.loadedScripts) {
      window.loadedScripts = {};
    }

    if (window.loadedScripts[url]) {
      return;
    }

    return new Promise((resolve, reject) => {
      const script = document.createElement("script");
      script.src = url;
      script.onload = resolve;
      script.onerror = reject;
      document.head.appendChild(script);
    }).then(() => (window.loadedScripts[url] = true));
  }
  
  await _loadScript("https://cdn.jsdelivr.net/pyodide/v0.26.4/full/pyodide.js");

  // Initialize Pyodide
  let pyodide;
  if (!window.pyodide) {
    pyodide = await loadPyodide();
    window.pyodide = pyodide; // Cache it globally for future use
  } else {
    pyodide = window.pyodide;
  }

  try {
    // Redirect standard output to a variable
    pyodide.runPython(`
      import io
      import sys
      
      sys.stdout = io.StringIO()
    `);

    // Load packages
    if (packages && packages.length > 0) {
      for (const packageName of packages) {
        try {
          await pyodide.loadPackage(packageName);
        } catch (e) {
          // If packages fail to load, notify in the output
          return { error: `Failed to load package ${packageName}. Error: ${e.message}` };
        }
      }
    }

    // Execute the Python code
    pyodide.runPython(code);

    // Get the captured output
    let output = pyodide.runPython("sys.stdout.getvalue()");

    // Reset standard output
    pyodide.runPython(`
        sys.stdout = sys.__stdout__
    `);

    return { output: output };
  } catch (error) {
    return { error: error.message };
  }
}

and it should work with python execution and python package loading, but note that pyodide has a limited selection of python packages, and it is very... very slow

@ngxson
Copy link
Collaborator

ngxson commented Feb 3, 2025

@benhaotang @ochafik IMO we should wait a bit more to see if the python tool will become standarized by new models. My POV is that it is still extremely experimental and only work with a small number of models, so probably not worth adding it right now, as it may add more headaches.

For now, let's focus mainly on having proper support for OAI-compat API for tool calls, so user can freely use llama.cpp in their existing code base.

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 3, 2025

still extremely experimental and only work with a small number of models

@ngxson Part of the problem is most tool call formats pass the arguments (code included) as JSON, and some models struggle to properly escape double quotes inside there (part of why Llama 3.1 <=8B struggles with basic hello worlds in the tests - attempts to open a python string cause the entire python code json string to close 🤦‍♂️, although most other models I've tried manage to sort their escapes well).

Formats that pass verbatim code back such as functionary v3.1 (freeform <|python_tag|>...) will probably fare better for more complex examples, but also theoretically it might be possible to special case the python tool's code argument in the grammar to ensure it's not prematurely cut-off python (fancy writing a JSON-escaped Python GBNF grammar, anyone? 🤡)

It would be good to get examples of code that people manage to squeeze out of the known working models (for DeepSeek R1, need to use this branch #11607 )

@ngxson
Copy link
Collaborator

ngxson commented Feb 3, 2025

Part of the problem is most tool call formats pass the arguments (code included) as JSON, and some models struggle to properly escape double quotes inside there

Tbh, that sounds like a bad design to be (If anyone ever actually uses this approach). I can assure you that no company in the world want to spend time to train a model just to write python wrapped inside a JSON.

Indeed, if you think as a model maker, it make more sense to train the model to give straight-forward tool call request. For example, if I ask the model to turn on the light in my house, I should provide it the function prototype and it should response with a JSON like "tool": "light", "location": "living_room", "state": "on". It make no sense to have the model return a code inside a JSON.

And for example, if you still want to return the python code (provided that you have a good prompt engineering skill), it's better to just ask the LLM to return in a markdown codeblock starting with "``` python-tool"

And now have a look at the <|python_tag|> used by llama 3.2. Same idea, just in a different form.

But then, let me tell you how bad this design decision is, even Meta don't want to use it! Just have a look at the model card, they just do prompt engineering instead of using that <|python_tag|>`

@Kreijstal
Copy link

unrelated but, I've seen deepseek and propietary models use non-ascii tags, they use utf-8 characters something like <►▼thinking> (not verbatim) why did <|tag|> notation got standarized?

@ngxson
Copy link
Collaborator

ngxson commented Feb 3, 2025

why did <|tag|> notation got standarized?

Simple reason why <tag> is bad is because it can be confused with HTML / XML tags. That's why people added the vertical bar <|tag|>. But unfortunately not all machine learning engineers understand this, so it did not become a standard (ML engineers are not necessary be software engineers after all)

And for why some models use Fullwidth Vertical Line, I don't know. If anyone know, please tell me! A clue that I found is that this started from some chinese models (before deepseek)

@Kreijstal
Copy link

Kreijstal commented Feb 3, 2025

why did <|tag|> notation got standarized?

Simple reason why <tag> is bad is because it can be confused with HTML / XML tags. That's why people added the vertical bar <|tag|>. But unfortunately not all machine learning engineers understand this, so it did not become a standard (ML engineers are not necessary be software engineers after all)

And for why some models use Fullwidth Vertical Line, I don't know. If anyone know, please tell me! A clue that I found is that this started from some chinese models (before deepseek)

The why is clear, because unicode symbols are rarer in text, <| is more likely to appear since it's ASCII printable. So they want to make very very unlikely to collide tags. Why they chose that character specifically, no idea.

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 3, 2025

Tbh, that sounds like a bad design to be (If anyone ever actually uses this approach). I can assure you that no company in the world want to spend time to train a model just to write python wrapped inside a JSON.

@ngxson Absolutely! And unfortunately it's the approach most seems to be using right now (at least as far as documentation and/or jinja templates show).

And then there are the inherent risks of prompt injection (should be easy to SEO a modern "; DROP ALL TABLES; " into brave search results that would break tool call parsing when injected as a tool call result and cause Python code execution (which is why i spent so much time isolating the execution in my demo agent setup)

DeepSeek has maybe the closest syntax to something good (cf. test-chat), although its template seems terribly broken rn (I'm toying with a revamped version and trying to get it to generate python code blocks, since it already generates json blocks), and plan on writing a Jinja templating good practices doc in a near future / think a bit harder about safe escaping options (maybe pass the special pseudo-tokens to minja for escapes).

@brucepro
Copy link
Contributor

brucepro commented Feb 3, 2025

Just an FYI, I am working on adding mcp to the webui client.

@teleprint-me
Copy link
Contributor

teleprint-me commented Feb 3, 2025

The JSON schema is how OpenAI handles calling functions. Not the literal json structure, but just metadata describing the call. The model produces the call, but the response is formatted according to the defined schema.

Not sure if there's a sane way, but using HTML would be naive for the reasons stated here. The | makes sense to me because its a unique format and is unlikely to conflict with markup and markdown standards. Simple and effective, even if ugly and or cumbersome.

The raw JSON format is more of a hacky solution. Personally, not a fan. Ideally, function calls would be language agnostic. Not sure why you would have language specific calls when you just need to describe the function, parameters, and ouput.

Still working my way there. I'm open to template recommendations.

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 3, 2025

Just an FYI, I am working on adding mcp to the webui client.

@brucepro lemme know if you need a review, even on draft code

I was thinking of wrapping MCP servers in a siloed environment and expose them as an openapi endpoint (similar to that) but i guess I could expose the wrapped / federated siloed servers as... an MCP server.

(Also, not sure i understand it all, looks like Claude desktop launches its MCP servers as subprocesses / using the stdio transport? didn't see easy prepackaged versions of HTTP / SSE transport servers)

@brucepro
Copy link
Contributor

brucepro commented Feb 3, 2025

Just an FYI, I am working on adding mcp to the webui client.

@brucepro lemme know if you need a review, even on draft code

I was thinking of wrapping MCP servers in a siloed environment and expose them as an openapi endpoint (similar to that) but i guess I could expose the wrapped / federated siloed servers as... an MCP server.

(Also, not sure i understand it all, looks like Claude desktop launches its MCP servers as subprocesses / using the stdio transport? didn't see easy prepackaged versions of HTTP / SSE transport servers)

I saw a pretty cool SSE proxy ( https://github.com/supercorp-ai/supergateway ) recently that allowed you to install mcp servers that the mcp client could then just make calls to. Right now I am testing the typescript sdk in the webui using the stdio support built in. It might be too heavy to bundle, so might be better to have it as a standalone. I love using the webui in the server but honestly, I think the server should just focus on being the best server. Soon as I have it working mostly will post it.
edit: this one was pretty cool too. https://github.com/rekog-labs/nest-mcp/tree/main

@ngxson
Copy link
Collaborator

ngxson commented Feb 3, 2025

It might be too heavy to bundle

I'm surprised about this. From my POV, the MCP is mostly like a wrapper that adds ability to do real-time communication on top of existing application logic, so it should be lightweight. If it ends up being bigger than 10kb, then I can already smell some over-engineering here.

In anw, I think it could be a cool idea to try out MCP if the adoption is somewhat acceptable (ref this list). Probably can add this as an experimental flag and people who want to try out can activate it manually. Big scripts can be loaded via CDN instead of bundling into llama-server

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 4, 2025

Big scripts can be loaded via CDN instead of bundling into llama-server

It would be good to keep a fully local AI mode tho ;-)

(speaking of which, the model url loading logic doesn't handle offline mode this well yet)

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 4, 2025

It would be good to get examples of code that people manage to squeeze out of the known working models (for DeepSeek R1, need to use this branch #11607 )

Added a simple example outputs for "count times Olivier appears on https://ochafik.com" coding task, DeepSeek R1 Distill Qwen 32B does decently there (and thinks a lot, which will now appear in its own hidden field):

#11607

@winstondu
Copy link

Great work!

@Dampfinchen
Copy link

It seems like it adds double BOS when using LLama 3.1 models. Doesn't happen without --jinja

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 9, 2025

It seems like it adds double BOS when using LLama 3.1 models. Doesn't happen without --jinja

@Dampfinchen should be fixed as of #11641 / b4641, which version of llama-server & exact model did you test this with?

@ochafik ochafik mentioned this pull request Feb 9, 2025
@ochafik
Copy link
Collaborator Author

ochafik commented Feb 11, 2025

@ngxson Part of the problem is most tool call formats pass the arguments (code included) as JSON, and some models struggle to properly escape double quotes inside there (part of why Llama 3.1 <=8B struggles with basic hello worlds in the tests - attempts to open a python string cause the entire python code json string to close 🤦‍♂️, although most other models I've tried manage to sort their escapes well).

Formats that pass verbatim code back such as functionary v3.1 (freeform <|python_tag|>...) will probably fare better for more complex examples, but also theoretically it might be possible to special case the python tool's code argument in the grammar to ensure it's not prematurely cut-off python (fancy writing a JSON-escaped Python GBNF grammar, anyone? 🤡)

So, I've had surprisingly good results with a simple pseudo-Python grammar that ensures code strings are valid structured token soups, guaranteeing string tokens aren't split (restricting allowed nested escapes) & open parentheses / braces / brackets are closed (in this branch).

It makes even Llama 3.x 1B / 3B / 8B super compliant & able to overcome the code escapes issues, even at very high temperatures (tested up to 5). Once finalized, it may also be a great way to guard against prompt injection (e.g. from tool results) for models that use special unicode tokens to close / open tool calls (if we mandate that unicode be escaped in the code's JSONified string), which is could be another reason why unicode symbols may have been chosen (cc/ @Kreijstal @ngxson re/ discussion above).

image

NOTE: results above and below are from my tool-bench branch which builds on top of #1160

It would be good to get examples of code that people manage to squeeze out of the known working models

FYI I've looked into benchmark options (cc/ @Maximilian-Winter ):

image

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
…istral, Firefunction, DeepSeek) w/ lazy grammars (ggml-org#9639)


---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
ngxson pushed a commit that referenced this pull request Feb 13, 2025
Call updated to match the tool used in the output just below, following the example in #9639
jpohhhh added a commit to Telosnex/fllama that referenced this pull request Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions enhancement New feature or request examples merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.