Skip to content

Commit

Permalink
Merge branch 'main' into khluu/gha_workflow_update_ci_deps
Browse files Browse the repository at this point in the history
  • Loading branch information
khluu committed Oct 30, 2024
2 parents 3367883 + 62fac4b commit 70d3112
Show file tree
Hide file tree
Showing 89 changed files with 1,435 additions and 374 deletions.
15 changes: 8 additions & 7 deletions .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,14 @@ pull_request_rules:
- name: label-ci-build
description: Automatically apply ci/build label
conditions:
- files~=^\.github/
- files~=\.buildkite/
- files~=^cmake/
- files=CMakeLists.txt
- files~=^Dockerfile
- files~=^requirements.*\.txt
- files=setup.py
- or:
- files~=^\.github/
- files~=\.buildkite/
- files~=^cmake/
- files=CMakeLists.txt
- files~=^Dockerfile
- files~=^requirements.*\.txt
- files=setup.py
actions:
label:
add:
Expand Down
2 changes: 2 additions & 0 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,8 @@ ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

RUN python3 -m pip install --upgrade pip

# Package upgrades for useful functionality or to avoid dependency issues
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install --upgrade numba scipy huggingface-hub[cli] pytest-shard
Expand Down
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,19 @@ Easy, fast, and cheap LLM serving for everyone
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>

---

**vLLM x Snowflake Meetup (Wednesday, November 13th, 5:30-8PM PT) at Snowflake HQ, San Mateo**

We are excited to announce the last in-person vLLM meetup of the year!
Join the vLLM developers and engineers from Snowflake AI Research to chat about the latest LLM inference optimizations and your 2025 vLLM wishlist!
Register [here](https://lu.ma/h0qvrajz) and be a part of the event!

---


*Latest News* 🔥
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?tab.day=20241001&search.sessiontracks=1719251906298001uzJ2) from other vLLM contributors and users!
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
Expand All @@ -42,7 +52,7 @@ vLLM is fast with:
- Speculative decoding
- Chunked prefill

**Performance benchmark**: We include a performance benchmark at the end of [our blog post](https://blog.vllm.ai/2024/09/05/perf-update.html). It compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [SGLang](https://github.com/sgl-project/sglang) and [LMDeploy](https://github.com/InternLM/lmdeploy)). The implementation is under [nightly-benchmarks folder](.buildkite/nightly-benchmarks/) and you can [reproduce](https://github.com/vllm-project/vllm/issues/8176) this benchmark using our one-click runnable script.
**Performance benchmark**: We include a performance benchmark at the end of [our blog post](https://blog.vllm.ai/2024/09/05/perf-update.html). It compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [SGLang](https://github.com/sgl-project/sglang) and [LMDeploy](https://github.com/InternLM/lmdeploy)). The implementation is under [nightly-benchmarks folder](.buildkite/nightly-benchmarks/) and you can [reproduce](https://github.com/vllm-project/vllm/issues/8176) this benchmark using our one-click runnable script.

vLLM is flexible and easy to use with:

Expand Down
6 changes: 3 additions & 3 deletions benchmarks/kernels/benchmark_layernorm.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
import torch

from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
seed_everything)
from vllm.platforms import current_platform
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser


@torch.inference_mode()
Expand All @@ -16,7 +16,7 @@ def main(num_tokens: int,
do_profile: bool = False,
num_warmup_iters: int = 5,
num_iters: int = 100) -> None:
seed_everything(seed)
current_platform.seed_everything(seed)
torch.set_default_device("cuda")

layer = RMSNorm(hidden_size).to(dtype=dtype)
Expand Down
7 changes: 4 additions & 3 deletions benchmarks/kernels/benchmark_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
from transformers import AutoConfig

from vllm.model_executor.layers.fused_moe.fused_moe import *
from vllm.utils import FlexibleArgumentParser, seed_everything
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser


class BenchmarkConfig(TypedDict):
Expand Down Expand Up @@ -167,7 +168,7 @@ class BenchmarkWorker:

def __init__(self, seed: int) -> None:
torch.set_default_device("cuda")
seed_everything(seed)
current_platform.seed_everything(seed)
self.seed = seed

def benchmark(
Expand All @@ -181,7 +182,7 @@ def benchmark(
use_fp8_w8a8: bool,
use_int8_w8a16: bool,
) -> Tuple[Dict[str, int], float]:
seed_everything(self.seed)
current_platform.seed_everything(self.seed)
dtype_str = get_config_dtype_str(dtype,
use_int8_w8a16=use_int8_w8a16,
use_fp8_w8a8=use_fp8_w8a8)
Expand Down
5 changes: 3 additions & 2 deletions benchmarks/kernels/benchmark_paged_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
import torch

from vllm import _custom_ops as ops
from vllm.platforms import current_platform
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
create_kv_caches_with_random, seed_everything)
create_kv_caches_with_random)

NUM_BLOCKS = 1024
PARTITION_SIZE = 512
Expand All @@ -28,7 +29,7 @@ def main(
device: str = "cuda",
kv_cache_dtype: Optional[str] = None,
) -> None:
seed_everything(seed)
current_platform.seed_everything(seed)

scale = float(1.0 / (head_size**0.5))
query = torch.empty(num_seqs,
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/kernels/benchmark_quant.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
import torch

from vllm import _custom_ops as ops
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
seed_everything)
from vllm.platforms import current_platform
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser


@torch.inference_mode()
Expand All @@ -17,7 +17,7 @@ def main(num_tokens: int,
do_profile: bool = False,
num_warmup_iters: int = 5,
num_iters: int = 100) -> None:
seed_everything(seed)
current_platform.seed_everything(seed)
torch.set_default_device("cuda")

x = torch.randn(num_tokens, hidden_size, dtype=dtype)
Expand Down
5 changes: 3 additions & 2 deletions benchmarks/kernels/benchmark_rope.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@

from vllm.model_executor.layers.rotary_embedding import (RotaryEmbedding,
get_rope)
from vllm.utils import FlexibleArgumentParser, seed_everything
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser


def benchmark_rope_kernels_multi_lora(
Expand All @@ -22,7 +23,7 @@ def benchmark_rope_kernels_multi_lora(
max_position: int = 8192,
base: int = 10000,
) -> None:
seed_everything(seed)
current_platform.seed_everything(seed)
torch.set_default_device(device)
if rotary_dim is None:
rotary_dim = head_size
Expand Down
6 changes: 4 additions & 2 deletions docs/source/models/adding_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,9 @@ If you are running api server with :code:`vllm serve <args>`, you can wrap the e
from vllm import ModelRegistry
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
import runpy
runpy.run_module('vllm.entrypoints.openai.api_server', run_name='__main__')
if __name__ == '__main__':
import runpy
runpy.run_module('vllm.entrypoints.openai.api_server', run_name='__main__')
Save the above code in a file and run it with :code:`python your_file.py <args>`.
21 changes: 20 additions & 1 deletion docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,9 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso

If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!


#### Hermes Models (`hermes`)

All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
* `NousResearch/Hermes-2-Pro-*`
* `NousResearch/Hermes-2-Theta-*`
Expand All @@ -197,7 +199,9 @@ step in their creation_.

Flags: `--tool-call-parser hermes`


#### Mistral Models (`mistral`)

Supported models:
* `mistralai/Mistral-7B-Instruct-v0.3` (confirmed)
* Additional mistral function-calling models are compatible as well.
Expand All @@ -216,7 +220,9 @@ when tools are provided, that results in much better reliability when working wi

Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`


#### Llama Models (`llama3_json`)

Supported models:
* `meta-llama/Meta-Llama-3.1-8B-Instruct`
* `meta-llama/Meta-Llama-3.1-70B-Instruct`
Expand All @@ -236,7 +242,9 @@ it works better with vLLM.

Recommended flags: `--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3_json.jinja`


#### InternLM Models (`internlm`)

Supported models:
* `internlm/internlm2_5-7b-chat` (confirmed)
* Additional internlm2.5 function-calling models are compatible as well
Expand All @@ -246,6 +254,7 @@ Known issues:

Recommended flags: `--tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja`


#### Jamba Models (`jamba`)
AI21's Jamba-1.5 models are supported.
* `ai21labs/AI21-Jamba-1.5-Mini`
Expand All @@ -255,6 +264,16 @@ AI21's Jamba-1.5 models are supported.
Flags: `--tool-call-parser jamba`


#### IBM Granite (`granite-20b-fc`)

Supported models:
* `ibm-granite/granite-20b-functioncalling`

Flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`

The example chat template deviates slightly from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.


### How to write a tool parser plugin

A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
Expand Down Expand Up @@ -312,5 +331,5 @@ Then you can use this plugin in the command line like this.
--tool-parser-plugin <absolute path of the plugin file>
--tool-call-parser example \
--chat-template <your chat template> \
```
```

130 changes: 130 additions & 0 deletions examples/tool_chat_template_granite_20b_fc.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
{%- macro json_to_python_type(json_spec) %}
{%- set basic_type_map = {
"string": "str",
"number": "float",
"integer": "int",
"boolean": "bool"
} %}

{%- if basic_type_map[json_spec.type] is defined %}
{{- basic_type_map[json_spec.type] }}
{%- elif json_spec.type == "array" %}
{{- "list[" + json_to_python_type(json_spec|items) + "]" }}
{%- elif json_spec.type == "object" %}
{%- if json_spec.additionalProperties is defined %}
{{- "dict[str, " + json_to_python_type(json_spec.additionalProperties) + ']' }}
{%- else %}
{{- "dict" }}
{%- endif %}
{%- elif json_spec.type is iterable %}
{{- "Union[" }}
{%- for t in json_spec.type %}
{{- json_to_python_type({"type": t}) }}
{%- if not loop.last %}
{{- "," }}
{%- endif %}
{%- endfor %}
{{- "]" }}
{%- else %}
{{- "Any" }}
{%- endif %}
{%- endmacro %}

{%- if not full_function_description is defined %}
{%- set full_function_description = false %}
{%- endif %}

{%- macro full_description(tool) %}
{{- tool.name + '(' }}
{%- if tool.parameters is defined %}
{%- for param_name, param_fields in tool.parameters.properties|items %}
{{- param_name + ": " + json_to_python_type(param_fields) }}
{%- if not loop.last %}
{{- ", " }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- ")" }}
{%- if tool.return is defined %}
{{- " -> " + json_to_python_type(tool.return) }}
{%- endif %}
{{- " - " + tool.description + "\n\n" }}
{%- if tool.parameters is defined %}
{%- for param_name, param_fields in tool.parameters.properties|items %}
{%- if loop.first %}
{{- " Args:\n" }}
{%- endif %}
{{- " " + param_name + "(" + json_to_python_type(param_fields) + "): " + param_fields.description|trim }}
{%- endfor %}
{%- endif %}
{%- if tool.return is defined and tool.return.description is defined %}
{{- "\n Returns:\n " + tool.return.description }}
{%- endif %}
{{- '"' }}
{%- endmacro %}

{%- macro simple_description(tool) %}
{{- tool.description }}
{%- endmacro %}

{%- macro function_description(tool) %}
{%- if full_function_description %}
{{- full_description(tool) }}
{%- else %}
{{- simple_description(tool) }}
{%- endif %}
{%- endmacro %}

{%- if messages[0]["role"] == "system" %}
{%- set sys_prompt = messages[0]["content"] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{% set sys_prompt = 'You are a helpful assistant with access to the following function calls. Your task is to understand the given conversation with function calls and responses and generate natural language response as the ASSISTANT to continue the conversation. You may use the following function calls to understand how to respond to the user query.' %}
{%- endif %}

{{ 'SYSTEM: ' + sys_prompt }}
{% if tools is iterable and tools | length > 0 %}
<|function_call_library|>
{%- for tool in tools %}
{%- if tool.function is defined %}
{%- set tool = tool.function %}
{%- endif %}
{{- '{"name": "' + tool.name + '", ' }}
{{- '"description": "' + function_description(tool) }}
{{- ', "parameters": ' }}
{%- if not tool.parameters is defined or tool.parameters.properties | length == 0 %}
{{- "{}" }}
{%- else %}
{{- tool.parameters|tojson }}
{%- endif %}
{{- "}" }}
{%- if not loop.last %}
{{- "\n" }}
{%- endif %}
{%- endfor %}
If none of the functions are relevant or the given question lacks the parameters required by the function, please output \"<function_call> {\"name\": \"no_function\", \"arguments\": {}}\".
{%- endif %}



{% for message in messages %}
{% if message['role'] == 'user' %}
{{- '\nUSER: ' + message['content'] }}
{% elif message['role'] == 'assistant' and message.tool_calls is defined %}
{{- '\nASSISTANT:' }}
{% for tc in message.tool_calls %}
{{- '<function_call> ' + {'name': tc.function.name, 'arguments': tc.function.arguments}|tojson }}
{% endfor %}
{{- '<|endoftext|>' }}
{% elif message['role'] == 'assistant' %}
{{- '\nASSISTANT: ' + message['content'] + ' <|endoftext|>' }}
{% elif message['role'] == 'tool' %}
{{- '<function_response> ' + message['content'] }}
{%- else %}
{{- raise_exception("Unexpected combination of role and message content") }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{- '\nASSISTANT: ' }}
{% endif %}
{% endfor %}
Loading

0 comments on commit 70d3112

Please sign in to comment.