Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dec 10 rebase #605

Merged
merged 52 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
8b59631
[Core] Support Lark grammars for XGrammar (#10870)
mgoin Dec 6, 2024
7406274
[Doc] add KubeAI to serving integrations (#10837)
samos123 Dec 6, 2024
c05cfb6
[misc] fix typo (#10960)
youkaichao Dec 6, 2024
dcdc3fa
[ci] fix broken tests (#10956)
youkaichao Dec 6, 2024
69d357b
[Core] Cleanup startup logging a bit (#10961)
russellb Dec 7, 2024
acf092d
[Bugfix] Fix test-pipeline.yaml (#10973)
jeejeelee Dec 7, 2024
955fa95
[3/N] Support and implement merged input processor for LLaVA model (#…
DarkLight1337 Dec 7, 2024
f13cf9a
[Build] Fix for the Wswitch-bool clang warning (#10060)
gshtras Dec 7, 2024
b26b4cd
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora imple…
Isotr0py Dec 7, 2024
bf0e382
[Model] Composite weight loading for multimodal Qwen2 (#10944)
DarkLight1337 Dec 7, 2024
1c768fe
[Doc] Explicitly state that InternVL 2.5 is supported (#10978)
DarkLight1337 Dec 7, 2024
39e227c
[Model] Update multi-modal processor to support Mantis(LLaVA) model (…
DarkLight1337 Dec 7, 2024
c889d58
[Doc] Explicitly state that PP isn't compatible with speculative deco…
DarkLight1337 Dec 7, 2024
78029b3
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when con…
xffxff Dec 7, 2024
1b62745
[core][executor] simplify instance id (#10976)
youkaichao Dec 7, 2024
7be15d9
[core][misc] remove use_dummy driver for _run_workers (#10920)
youkaichao Dec 7, 2024
fd57d2b
[torch.compile] allow candidate compile sizes (#10984)
youkaichao Dec 8, 2024
a11f326
[V1] Initial support of multimodal models for V1 re-arch (#10699)
ywang96 Dec 8, 2024
43b05fa
[torch.compile][misc] fix comments (#10993)
youkaichao Dec 8, 2024
46004e8
[misc] clean up and unify logging (#10999)
youkaichao Dec 9, 2024
af7c4a9
[Doc][V1] Add V1 support column for multimodal models (#10998)
ywang96 Dec 9, 2024
d1c2e15
[torch.compile] add dynamo time tracking (#11005)
youkaichao Dec 9, 2024
c690357
[V1] Fix Detokenizer loading in `AsyncLLM` (#10997)
ywang96 Dec 9, 2024
e691b26
[Core] Require xgrammar >= 0.1.6 (#11021)
russellb Dec 9, 2024
aea2fc3
[Platform] Move `async output` check to platform (#10768)
wangxiyuan Dec 9, 2024
25b79d9
[V1] Input Batch Relocation (#10962)
varun-sundar-rabindranath Dec 9, 2024
edc4fa3
[ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
khluu Dec 9, 2024
3b61cb4
[V1] Further reduce CPU overheads in flash-attn (#10989)
WoosukKwon Dec 9, 2024
ca87149
[Misc][LoRA] Abstract PunicaWrapper (#10955)
jeejeelee Dec 9, 2024
a811dd6
[Model] merged input processor for Phi-3-Vision models (#10977)
Isotr0py Dec 9, 2024
cbcbdb1
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
kzawora-intel Dec 9, 2024
1a2f8fb
[v1] fix use compile sizes (#11000)
youkaichao Dec 9, 2024
9c6459e
[Neuron] Upgrade neuron to 2.20.2 (#11016)
xendo Dec 9, 2024
b63ba84
[ROCm][bugfix] scpecilative decoding worker class (#11035)
gshtras Dec 9, 2024
5ed5d5f
Build tpu image in release pipeline (#10936)
richardsliu Dec 9, 2024
6faec54
[V1] Do not store `None` in self.generators (#11038)
WoosukKwon Dec 9, 2024
6d52528
[Docs] Add dedicated tool calling page to docs (#10554)
mgoin Dec 10, 2024
d1f6d1c
[Model] Add has_weight to RMSNorm and re-enable weights loading track…
Isotr0py Dec 10, 2024
391d7b2
[Bugfix] Fix usage of `deprecated` decorator (#11025)
DarkLight1337 Dec 10, 2024
980ad39
[Frontend] Use request id from header (#10968)
joerunde Dec 10, 2024
bc192a2
[Pixtral] Improve loading (#11040)
patrickvonplaten Dec 10, 2024
28b3a1c
[V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
tlrmchlsmth Dec 10, 2024
ebf7780
monitor metrics of tokens per step using cudagraph batchsizes (#11031)
youkaichao Dec 10, 2024
e35879c
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig o…
sjuxax Dec 10, 2024
bfd6104
Update README.md (#11034)
dmoliveira Dec 10, 2024
82c73fd
[Bugfix] cuda error running llama 3.2 (#11047)
GeneDer Dec 10, 2024
fe2e10c
Add example of helm chart for vllm deployment on k8s (#9199)
mfournioux Dec 10, 2024
2126fd2
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Dec 10, 2024
89266bc
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Dec 10, 2024
5a166da
Update ray_hpu_executor.py
michalkuligowski Dec 10, 2024
b8fff21
Add PunicaWrapperHPU to handle LoRA computations
SanjuCSudhakaran Dec 11, 2024
381453c
Align LoRA handling in HPU with PunicaWrapper class (#614)
kzawora-intel Dec 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,19 @@ steps:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
4 changes: 3 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ steps:
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore lora/test_long_context.py lora/test_chatglm3_tp.py lora/test_llama_tp.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py
parallelism: 4

- label: "PyTorch Fullgraph Smoke Test" # 9min
Expand Down Expand Up @@ -362,6 +362,7 @@ steps:
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
- pytest -v -s models/embedding/vision_language -m core_model
Expand All @@ -377,6 +378,7 @@ steps:
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'not core_model and not quant_model'
# HACK - run phi3v tests separately to sidestep this transformers bug
# https://github.com/huggingface/transformers/issues/34307
Expand Down
81 changes: 81 additions & 0 deletions .github/workflows/lint-and-deploy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
name: Lint and Deploy Charts

on: pull_request

jobs:
lint-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0

- name: Set up Helm
uses: azure/setup-helm@fe7b79cd5ee1e45176fcad797de68ecaf3ca4814 # v4.2.0
with:
version: v3.14.4

#Python is required because ct lint runs Yamale and yamllint which require Python.
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: '3.13'

- name: Set up chart-testing
uses: helm/chart-testing-action@e6669bcd63d7cb57cb4380c33043eebe5d111992 # v2.6.1
with:
version: v3.10.1

- name: Run chart-testing (lint)
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/chart-helm --charts examples/chart-helm

- name: Setup minio
run: |
docker network create vllm-net
docker run -d -p 9000:9000 --name minio --net vllm-net \
-e "MINIO_ACCESS_KEY=minioadmin" \
-e "MINIO_SECRET_KEY=minioadmin" \
-v /tmp/data:/data \
-v /tmp/config:/root/.minio \
minio/minio server /data
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_EC2_METADATA_DISABLED=true
mkdir opt-125m
cd opt-125m && curl -O -Ls "https://huggingface.co/facebook/opt-125m/resolve/main/{pytorch_model.bin,config.json,generation_config.json,merges.txt,special_tokens_map.json,tokenizer_config.json,vocab.json}" && cd ..
aws --endpoint-url http://127.0.0.1:9000/ s3 mb s3://testbucket
aws --endpoint-url http://127.0.0.1:9000/ s3 cp opt-125m/ s3://testbucket/opt-125m --recursive

- name: Create kind cluster
uses: helm/kind-action@0025e74a8c7512023d06dc019c617aa3cf561fde # v1.10.0

- name: Build the Docker image vllm cpu
run: docker buildx build -f Dockerfile.cpu -t vllm-cpu-env .

- name: Configuration of docker images, network and namespace for the kind cluster
run: |
docker pull amazon/aws-cli:2.6.4
kind load docker-image amazon/aws-cli:2.6.4 --name chart-testing
kind load docker-image vllm-cpu-env:latest --name chart-testing
docker network connect vllm-net "$(docker ps -aqf "name=chart-testing-control-plane")"
kubectl create ns ns-vllm

- name: Run chart-testing (install)
run: |
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/chart-helm -f examples/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"

- name: curl test
run: |
kubectl -n ns-vllm port-forward service/test-vllm-service 8001:80 &
sleep 10
CODE="$(curl -v -f --location http://localhost:8001/v1/completions \
--header "Content-Type: application/json" \
--data '{
"model": "opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'):$CODE"
echo "$CODE"
3 changes: 2 additions & 1 deletion Dockerfile.neuron
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
# https://gallery.ecr.aws/neuron/pytorch-inference-neuronx
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04"

FROM $BASE_IMAGE

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Easy, fast, and cheap LLM serving for everyone
---

*Latest News* 🔥
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
Expand Down
11 changes: 4 additions & 7 deletions csrc/attention/paged_attention_v1.cu
Original file line number Diff line number Diff line change
Expand Up @@ -140,13 +140,10 @@ void paged_attention_v1_launcher(
blocksparse_block_size, blocksparse_head_sliding_step);

#define CALL_V1_LAUNCHER_SPARSITY(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE) \
switch (is_block_sparse) { \
case true: \
CALL_V1_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, true); \
break; \
case false: \
CALL_V1_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, false); \
break; \
if (is_block_sparse) { \
CALL_V1_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, true); \
} else { \
CALL_V1_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, false); \
}

// NOTE(woosuk): To reduce the compilation time, we omitted block sizes
Expand Down
11 changes: 4 additions & 7 deletions csrc/attention/paged_attention_v2.cu
Original file line number Diff line number Diff line change
Expand Up @@ -147,13 +147,10 @@ void paged_attention_v2_launcher(
blocksparse_head_sliding_step);

#define CALL_V2_LAUNCHER_SPARSITY(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE) \
switch (is_block_sparse) { \
case true: \
CALL_V2_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, true); \
break; \
case false: \
CALL_V2_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, false); \
break; \
if (is_block_sparse) { \
CALL_V2_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, true); \
} else { \
CALL_V2_LAUNCHER(T, CACHE_T, BLOCK_SIZE, IS_FP8_KV_CACHE, false); \
}

// NOTE(woosuk): To reduce the compilation time, we omitted block sizes
Expand Down
14 changes: 12 additions & 2 deletions csrc/cache_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -307,10 +307,20 @@ void reshape_and_cache_flash(
torch::Tensor& key_cache, // [num_blocks, block_size, num_heads, head_size]
torch::Tensor&
value_cache, // [num_blocks, block_size, num_heads, head_size]
torch::Tensor& slot_mapping, // [num_tokens]
torch::Tensor& slot_mapping, // [num_tokens] or [num_actual_tokens]
const std::string& kv_cache_dtype, const double k_scale,
const double v_scale) {
int num_tokens = key.size(0);
// NOTE(woosuk): In vLLM V1, key.size(0) can be different from
// slot_mapping.size(0) because of padding for CUDA graphs.
// In vLLM V0, key.size(0) is always equal to slot_mapping.size(0) because
// both include padding.
// In vLLM V1, however, key.size(0) can be larger than slot_mapping.size(0)
// since key includes padding for CUDA graphs, while slot_mapping does not.
// In this case, slot_mapping.size(0) represents the actual number of tokens
// before padding.
// For compatibility with both cases, we use slot_mapping.size(0) as the
// number of tokens.
int num_tokens = slot_mapping.size(0);
int num_heads = key.size(1);
int head_size = key.size(2);
int block_size = key_cache.size(1);
Expand Down
2 changes: 1 addition & 1 deletion csrc/mamba/causal_conv1d/causal_conv1d.cu
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ void causal_conv1d_fwd_kernel(ConvParamsBase params) {
// and the one before it (chunk = n_chunks - 1 and chunk = n_chunks - 2),
// (which occurs when `final_state_position` is a non-positivie index)
// we load the correct data from smem_exchange from both chunks, the last chunk iteration and the one before it
if (final_state_position < 0 && seqlen > kWidth){
if (conv_states != nullptr && final_state_position < 0 && seqlen > kWidth){
input_t vals_load[kNElts] = {0};
if ((chunk == n_chunks - 2) && (tidx == kNThreads - 1)){
// chunk = n_chunks - 2, a segment of the final state sits in the last index
Expand Down
1 change: 1 addition & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,6 @@ mistral_common >= 1.5.0
aiohttp
starlette
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
fastapi # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
requests
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ Documentation
serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_helm
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
Expand All @@ -102,6 +103,7 @@ Documentation

usage/lora
usage/multimodal_inputs
usage/tool_calling
usage/structured_outputs
usage/spec_decode
usage/compatibility_matrix
Expand Down
Loading
Loading