Skip to content

Commit

Permalink
[Frontend] Tool calling parser for Granite 3.0 models (vllm-project#9027
Browse files Browse the repository at this point in the history
)

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
  • Loading branch information
maxdebayser authored and tlrmchlsmth committed Nov 23, 2024
1 parent ebb06b5 commit 4f240ef
Show file tree
Hide file tree
Showing 6 changed files with 314 additions and 33 deletions.
44 changes: 26 additions & 18 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,14 +160,7 @@ this, unless explicitly specified.
:func: create_parser_for_docs
:prog: vllm serve
```
## Tool Calling in the Chat Completion API
### Named Function Calling
vLLM supports only named function calling in the chat completion API by default. It does so using Outlines, so this is
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
high-quality one.

To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.

### Config file

Expand Down Expand Up @@ -196,12 +189,22 @@ The order of priorities is `command line > config file values > defaults`.
---

## Tool calling in the chat completion API
vLLM supports only named function calling in the chat completion API. The `tool_choice` options `auto` and `required` are **not yet supported** but on the roadmap.

vLLM supports named function calling and `auto` tool choice in the chat completion API. The `tool_choice` options `required` is **not yet supported** but on the roadmap.

It is the callers responsibility to prompt the model with the tool information, vLLM will not automatically manipulate the prompt.


### Named Function Calling
vLLM supports named function calling in the chat completion API by default. It does so using Outlines, so this is
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
high-quality one.

vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.

To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.


### Automatic Function Calling
To enable this feature, you should set the following flags:
Expand Down Expand Up @@ -275,6 +278,21 @@ it works better with vLLM.

Recommended flags: `--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3_json.jinja`

#### IBM Granite

Supported models:
* `ibm-granite/granite-3.0-8b-instruct`

Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`

`examples/tool_chat_template_granite.jinja`: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.

* `ibm-granite/granite-20b-functioncalling`

Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`

`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.


#### InternLM Models (`internlm`)

Expand All @@ -297,16 +315,6 @@ AI21's Jamba-1.5 models are supported.
Flags: `--tool-call-parser jamba`


#### IBM Granite (`granite-20b-fc`)

Supported models:
* `ibm-granite/granite-20b-functioncalling`

Flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`

The example chat template deviates slightly from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.


### How to write a tool parser plugin

A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
Expand Down
40 changes: 40 additions & 0 deletions examples/tool_chat_template_granite.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{%- if tools %}
{{- '<|start_of_role|>available_tools<|end_of_role|>
' }}
{%- for tool in tools %}
{{- tool | tojson(indent=4) }}
{%- if not loop.last %}
{{- '

' }}
{%- endif %}
{%- endfor %}
{{- '<|end_of_text|>
' }}
{%- endif %}

{%- for message in messages %}
{%- if message['role'] == 'system' %}
{{- '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>
' }}
{%- elif message['role'] == 'user' %}
{{- '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>
' }}
{%- elif message['role'] == 'assistant_tool_call' or (message['role'] == 'assistant' and message.tool_calls is defined) %}
{{- '<|start_of_role|>assistant<|end_of_role|>' }}
{% for tc in message.tool_calls %}
{{- '<|tool_call|> ' + {'name': tc.function.name, 'arguments': tc.function.arguments}|tojson }}
{% endfor %}
{{- '<|end_of_text|>
' }}
{%- elif message['role'] == 'assistant' %}
{{- '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>
' }}
{%- elif message['role'] == 'tool_response' or message['role'] == 'tool' %}
{{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>
' }}
{%- endif %}
{%- if loop.last and add_generation_prompt %}
{{- '<|start_of_role|>assistant<|end_of_role|>' }}
{%- endif %}
{%- endfor %}
6 changes: 6 additions & 0 deletions tests/tool_use/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from huggingface_hub import snapshot_download

from tests.utils import RemoteOpenAIServer
from vllm.platforms import current_platform

from .utils import ARGS, CONFIGS, ServerConfig

Expand All @@ -11,6 +12,11 @@
@pytest.fixture(scope="session", params=CONFIGS.keys())
def server_config(request):
config = CONFIGS[request.param]

if current_platform.is_rocm() and not config.get("supports_rocm", True):
pytest.skip("The {} model can't be tested on the ROCm platform".format(
config["model"]))

# download model and tokenizer using transformers
snapshot_download(config["model"])
yield CONFIGS[request.param]
Expand Down
37 changes: 24 additions & 13 deletions tests/tool_use/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ class ServerConfig(TypedDict, total=False):
arguments: List[str]
system_prompt: Optional[str]
supports_parallel: Optional[bool]
supports_rocm: Optional[bool]


def patch_system_prompt(messages: List[Dict[str, Any]],
Expand All @@ -36,7 +37,7 @@ def ensure_system_prompt(messages: List[Dict[str, Any]],

# universal args for all models go here. also good if you need to test locally
# and change type or KV cache quantization or something.
ARGS: List[str] = ["--enable-auto-tool-choice", "--max-model-len", "8096"]
ARGS: List[str] = ["--enable-auto-tool-choice", "--max-model-len", "1024"]

CONFIGS: Dict[str, ServerConfig] = {
"hermes": {
Expand Down Expand Up @@ -88,18 +89,28 @@ def ensure_system_prompt(messages: List[Dict[str, Any]],
"without calling a tool. DO NOT CALL A TOOL THAT IS IRRELEVANT "
"to the user's question - just respond to it normally."
},
## FIXME: temporary disabled due to lack of hardware specification
## for individual runs
#"granite20b": {
# "model":
# "ibm-granite/granite-20b-functioncalling",
# "arguments": [
# "--tool-call-parser", "granite-20b-fc", "--chat-template",
# str(VLLM_PATH / "examples/tool_chat_template_granite_20b_fc.jinja")
# ],
# "supports_parallel":
# False,
#},
"granite20b": {
"model":
"mbayser/granite-20b-functioncalling-FP8-KV",
"arguments": [
"--tool-call-parser", "granite-20b-fc", "--chat-template",
str(VLLM_PATH /
"examples/tool_chat_template_granite_20b_fc.jinja"),
"--max_num_seqs", "1", "--enforce-eager", "--cpu-offload-gb", "20"
],
"supports_parallel":
False,
"supports_rocm":
False,
},
"granite8b": {
"model":
"ibm-granite/granite-3.0-8b-instruct",
"arguments": [
"--tool-call-parser", "granite", "--chat-template",
str(VLLM_PATH / "examples/tool_chat_template_granite.jinja")
],
},
"internlm": {
"model":
"internlm/internlm2_5-7b-chat",
Expand Down
5 changes: 3 additions & 2 deletions vllm/entrypoints/openai/tool_parsers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from .abstract_tool_parser import ToolParser, ToolParserManager
from .granite_20b_fc_tool_parser import Granite20bFCToolParser
from .granite_tool_parser import GraniteToolParser
from .hermes_tool_parser import Hermes2ProToolParser
from .internlm2_tool_parser import Internlm2ToolParser
from .jamba_tool_parser import JambaToolParser
Expand All @@ -8,6 +9,6 @@

__all__ = [
"ToolParser", "ToolParserManager", "Granite20bFCToolParser",
"Hermes2ProToolParser", "MistralToolParser", "Internlm2ToolParser",
"Llama3JsonToolParser", "JambaToolParser"
"GraniteToolParser", "Hermes2ProToolParser", "MistralToolParser",
"Internlm2ToolParser", "Llama3JsonToolParser", "JambaToolParser"
]
Loading

0 comments on commit 4f240ef

Please sign in to comment.