Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 24 12 16 #330

Merged
merged 102 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
5ed5d5f
Build tpu image in release pipeline (#10936)
richardsliu Dec 9, 2024
6faec54
[V1] Do not store `None` in self.generators (#11038)
WoosukKwon Dec 9, 2024
6d52528
[Docs] Add dedicated tool calling page to docs (#10554)
mgoin Dec 10, 2024
d1f6d1c
[Model] Add has_weight to RMSNorm and re-enable weights loading track…
Isotr0py Dec 10, 2024
391d7b2
[Bugfix] Fix usage of `deprecated` decorator (#11025)
DarkLight1337 Dec 10, 2024
980ad39
[Frontend] Use request id from header (#10968)
joerunde Dec 10, 2024
bc192a2
[Pixtral] Improve loading (#11040)
patrickvonplaten Dec 10, 2024
28b3a1c
[V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
tlrmchlsmth Dec 10, 2024
ebf7780
monitor metrics of tokens per step using cudagraph batchsizes (#11031)
youkaichao Dec 10, 2024
e35879c
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig o…
sjuxax Dec 10, 2024
bfd6104
Update README.md (#11034)
dmoliveira Dec 10, 2024
82c73fd
[Bugfix] cuda error running llama 3.2 (#11047)
GeneDer Dec 10, 2024
fe2e10c
Add example of helm chart for vllm deployment on k8s (#9199)
mfournioux Dec 10, 2024
beb16b2
[Bugfix] Handle <|tool_call|> token in granite tool parser (#11039)
tjohnson31415 Dec 10, 2024
d05f886
[Misc][LoRA] Add PEFTHelper for LoRA (#11003)
jeejeelee Dec 10, 2024
9b9cef3
[Bugfix] Backport request id validation to v0 (#11036)
joerunde Dec 10, 2024
250ee65
[BUG] Remove token param #10921 (#11022)
flaviabeo Dec 10, 2024
e739194
[Core] Update to outlines >= 0.1.8 (#10576)
russellb Dec 10, 2024
75f89dc
[torch.compile] add a flag to track batchsize statistics (#11059)
youkaichao Dec 10, 2024
134810b
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 (#11061)
WoosukKwon Dec 10, 2024
9a93973
[Bugfix] Fix Mamba multistep (#11071)
tlrmchlsmth Dec 11, 2024
d5c5154
[Misc] LoRA + Chunked Prefill (#9057)
aurickq Dec 11, 2024
ffa48c9
[Model] PP support for Mamba-like models (#10992)
mzusman Dec 11, 2024
e39400a
Fix streaming for granite tool call when <|tool_call|> is present (#1…
maxdebayser Dec 11, 2024
2e33fe4
[CI/Build] Check transformers v4.47 (#10991)
DarkLight1337 Dec 11, 2024
3fb4b4f
[ci/build] Fix AMD CI dependencies (#11087)
khluu Dec 11, 2024
9974fca
[ci/build] Fix entrypoints test and pin outlines version (#11088)
khluu Dec 11, 2024
61b1d2f
[Core] v1: Use atexit to handle engine core client shutdown (#11076)
russellb Dec 11, 2024
2e32f5d
[Bugfix] Fix Idefics3 fails during multi-image inference (#11080)
B-201 Dec 11, 2024
40766ca
[Bugfix]: Clamp `-inf` logprob values in prompt_logprobs (#11073)
rafvasq Dec 11, 2024
8f10d5e
[Misc] Split up pooling tasks (#10820)
DarkLight1337 Dec 11, 2024
cad5c0a
[Doc] Update docs to refer to pooling models (#11093)
DarkLight1337 Dec 11, 2024
b2f7754
[CI/Build] Enable prefix caching test for AMD (#11098)
hissu-hyvarinen Dec 11, 2024
fd22220
[Doc] Installed version of llmcompressor for int8/fp8 quantization (#…
bingps Dec 11, 2024
91642db
[torch.compile] use depyf to dump torch.compile internals (#10972)
youkaichao Dec 11, 2024
d643c2a
[V1] Use input_ids as input for text-only models (#11032)
WoosukKwon Dec 11, 2024
66aaa77
[torch.compile] remove graph logging in ci (#11110)
youkaichao Dec 11, 2024
72ff3a9
[core] Bump ray to use _overlap_gpu_communication in compiled graph t…
ruisearch42 Dec 11, 2024
d1e21a9
[CI/Build] Split up VLM tests (#11083)
DarkLight1337 Dec 11, 2024
452a723
[V1][Core] Remove should_shutdown to simplify core process terminatio…
tlrmchlsmth Dec 11, 2024
4e11683
[V1] VLM preprocessor hashing (#11020)
alexm-neuralmagic Dec 12, 2024
7439a8b
[Bugfix] Multiple fixes to tool streaming with hermes and mistral (#1…
cedonley Dec 12, 2024
8fb26da
[Docs] Add media kit (#11121)
simon-mo Dec 12, 2024
24a36d6
Update link to LlamaStack remote vLLM guide in serving_with_llamastac…
terrytangyuan Dec 12, 2024
ccede2b
[Core] cleanup zmq ipc sockets on exit (#11115)
russellb Dec 12, 2024
1da8f0e
[Model] Add support for embedding model GritLM (#10816)
pooyadavoodi Dec 12, 2024
f092153
[V1] Use more persistent buffers to optimize input preparation overhe…
WoosukKwon Dec 12, 2024
8195824
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) (#1…
SanjuCSudhakaran Dec 12, 2024
62de37a
[core][distributed] initialization from StatelessProcessGroup (#10986)
youkaichao Dec 12, 2024
85362f0
[Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094)
Jeffwan Dec 12, 2024
4816d20
[V1] Fix torch profiling for offline inference (#11125)
ywang96 Dec 12, 2024
d4d5291
fix(docs): typo in helm install instructions (#11141)
ramonziai Dec 12, 2024
5d71257
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e2…
sjuxax Dec 12, 2024
2c97eca
[Misc] Validate grammar and fail early (#11119)
comaniac Dec 12, 2024
9f3974a
Fix logging of the vLLM Config (#11143)
JArnoldAMD Dec 12, 2024
db6c264
[Bugfix] Fix value unpack error of simple connector for KVCache trans…
ShangmingCai Dec 12, 2024
78ed8f5
[Misc][V1] Fix type in v1 prefix caching (#11151)
comaniac Dec 13, 2024
30870b4
[torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
ProExpertProg Dec 13, 2024
1efce68
[Bugfix] Use runner_type instead of task in GritLM (#11144)
pooyadavoodi Dec 13, 2024
3989a79
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quan…
dsikka Dec 13, 2024
00c1bde
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm (#11146)
gshtras Dec 13, 2024
34f1a80
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' (#11…
comaniac Dec 13, 2024
be39e3c
[core] clean up cudagraph batchsize padding logic (#10996)
youkaichao Dec 13, 2024
7cd7409
PaliGemma 2 support (#11142)
janimo Dec 13, 2024
f93bf2b
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.tx…
bigPYJ1151 Dec 13, 2024
eeec9e3
[Frontend] Separate pooling APIs in offline inference (#11129)
DarkLight1337 Dec 13, 2024
969da7d
[V1][VLM] Fix edge case bug for InternVL2 (#11165)
ywang96 Dec 13, 2024
d1fa714
[Refactor]A simple device-related refactor (#11163)
noemotiovon Dec 13, 2024
c31d4a5
[Core] support LoRA and prompt adapter in content-based hashing for B…
llsj14 Dec 13, 2024
5b0ed83
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in Allo…
zhangjf-nlp Dec 13, 2024
238c0d9
[Misc] Add tokenizer_mode param to benchmark_serving.py (#11174)
alexm-neuralmagic Dec 13, 2024
0920ab9
[Doc] Reorganize online pooling APIs (#11172)
DarkLight1337 Dec 13, 2024
0a56bcc
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend (#11169)
janimo Dec 13, 2024
0d8451c
[Distributed] Allow the placement group more time to wait for resourc…
Jeffwan Dec 13, 2024
4863e5f
[Core] V1: Use multiprocessing by default (#11074)
russellb Dec 14, 2024
4b5b8a6
[V1][Bugfix] Fix EngineCoreProc profile (#11185)
tlrmchlsmth Dec 14, 2024
9855aea
[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
comaniac Dec 14, 2024
24a3d12
update compressed-tensors to latest version (#11183)
dhuangnm Dec 14, 2024
4825926
[Core] Update outlines and increase its threadpool size (#11140)
russellb Dec 14, 2024
ea7bd68
[V1][Bugfix] Fix V1 TP trust-remote-code (#11182)
tlrmchlsmth Dec 14, 2024
3cb5769
[Misc] Minor improvements to the readability of PunicaWrapperBase (#1…
jeejeelee Dec 14, 2024
9c3dadd
[Frontend] Add `logits_processors` as an extra completion argument (#…
bradhilton Dec 14, 2024
93abf23
[VLM] Fully dynamic prompt replacement in merged input processor (#11…
DarkLight1337 Dec 14, 2024
6d917d0
Enable mypy checking on V1 code (#11105)
markmc Dec 14, 2024
8869368
[Performance][Core] Optimize the performance of evictor v1 and v2 by …
llsj14 Dec 14, 2024
15859f2
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 (#11201)
jeejeelee Dec 15, 2024
a1c0205
[torch.compile] allow tracking forward time (#11081)
youkaichao Dec 15, 2024
b10609e
[Misc] Clean up multi-modal processor (#11207)
DarkLight1337 Dec 15, 2024
96d673e
[Bugfix] Fix error handling of unsupported sliding window (#11213)
DarkLight1337 Dec 15, 2024
38e599d
[Doc] add documentation for disaggregated prefilling (#11197)
KuntaiDu Dec 15, 2024
d263bd9
[Core] Support disaggregated prefill with Mooncake Transfer Engine (#…
ShangmingCai Dec 15, 2024
25ebed2
[V1][Minor] Cache np arange to reduce input preparation overhead (#11…
WoosukKwon Dec 15, 2024
da6f409
Update deploying_with_k8s.rst (#10922)
AlexHe99 Dec 16, 2024
69ba344
[Bugfix] Fix block size validation (#10938)
chenqianfzh Dec 16, 2024
17138af
[Bugfix] Fix the default value for temperature in ChatCompletionReque…
yansh97 Dec 16, 2024
b3b1526
WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 (#11212)
cennn Dec 16, 2024
bddbbcb
[Model] Support Cohere2ForCausalLM (Cohere R7B) (#11203)
janimo Dec 16, 2024
d927dbc
[Model] Refactor Ultravox to use merged input processor (#11198)
Isotr0py Dec 16, 2024
2ca830d
[Doc] Reorder vision language examples in alphabet order (#11228)
Isotr0py Dec 16, 2024
1a8e549
Merge remote-tracking branch 'upstream/main'
gshtras Dec 16, 2024
78440dc
Deprecating sync_openai
gshtras Dec 16, 2024
ddec133
Remove new irrelevant action
gshtras Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,19 @@ steps:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
47 changes: 30 additions & 17 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -181,14 +181,14 @@ steps:
commands:
- VLLM_USE_V1=1 pytest -v -s v1

- label: Examples Test # 15min
- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples"
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/entrypoints
- examples/
commands:
- pip install awscli tensorizer # for llava example and tensorizer test
- pip install tensorizer # for tensorizer test
- python3 offline_inference.py
- python3 cpu_offload.py
- python3 offline_inference_chat.py
Expand All @@ -198,10 +198,13 @@ steps:
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py
- python3 offline_inference_classification.py
- python3 offline_inference_embedding.py
- python3 offline_inference_scoring.py
- python3 offline_profile.py --model facebook/opt-125m

- label: Prefix Caching Test # 9min
#mirror_hardwares: [amd]
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/prefix_caching
Expand Down Expand Up @@ -321,7 +324,7 @@ steps:

##### models test #####

- label: Basic Models Test # 30min
- label: Basic Models Test # 24min
source_file_dependencies:
- vllm/
- tests/models
Expand All @@ -331,7 +334,7 @@ steps:
- pytest -v -s models/test_registry.py
- pytest -v -s models/test_initialization.py

- label: Language Models Test (Standard) # 42min
- label: Language Models Test (Standard) # 32min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
Expand All @@ -342,7 +345,7 @@ steps:
- pytest -v -s models/decoder_only/language -m 'core_model or quant_model'
- pytest -v -s models/embedding/language -m core_model

- label: Language Models Test (Extended) # 50min
- label: Language Models Test (Extended) # 1h10min
optional: true
source_file_dependencies:
- vllm/
Expand All @@ -353,7 +356,7 @@ steps:
- pytest -v -s models/decoder_only/language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/language -m 'not core_model'

- label: Multi-Modal Models Test (Standard) # 26min
- label: Multi-Modal Models Test (Standard) # 28min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
Expand All @@ -369,7 +372,7 @@ steps:
- pytest -v -s models/encoder_decoder/language -m core_model
- pytest -v -s models/encoder_decoder/vision_language -m core_model

- label: Multi-Modal Models Test (Extended) # 1h15m
- label: Multi-Modal Models Test (Extended) 1 # 1h16m
optional: true
source_file_dependencies:
- vllm/
Expand All @@ -380,14 +383,24 @@ steps:
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'not core_model and not quant_model'
- pytest -v -s models/decoder_only/vision_language/test_models.py -m 'split(group=0) and not core_model and not quant_model'
# HACK - run phi3v tests separately to sidestep this transformers bug
# https://github.com/huggingface/transformers/issues/34307
- pytest -v -s models/decoder_only/vision_language/test_phi3v.py
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model and not quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_models.py --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/vision_language -m 'not core_model'
- pytest -v -s models/encoder_decoder/language -m 'not core_model'
- pytest -v -s models/encoder_decoder/vision_language -m 'not core_model'

- label: Multi-Modal Models Test (Extended) 2 # 38m
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/vision_language/test_models.py -m 'split(group=1) and not core_model and not quant_model'

# This test is used only in PR development phase to test individual models and should never run on main
- label: Custom Models Test
optional: true
Expand Down Expand Up @@ -422,11 +435,11 @@ steps:
- tests/distributed/
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'

- label: Distributed Tests (2 GPUs) # 40min
#mirror_hardwares: [amd]
Expand All @@ -445,12 +458,12 @@ steps:
commands:
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py -v -s -m distributed_2_gpus
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
- pytest models/decoder_only/vision_language/test_models.py -v -s -m distributed_2_gpus
- pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py
Expand Down Expand Up @@ -540,7 +553,7 @@ steps:
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
- torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m distributed_2_gpus
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s -x lora/test_mixtral.py

- label: LM Eval Large Models # optional
Expand Down
3 changes: 2 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,7 @@ set(VLLM_EXT_SRC
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
"csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu"
"csrc/quantization/gguf/gguf_kernel.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/prepare_inputs/advance_step.cu"
Expand Down Expand Up @@ -334,7 +335,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
#
# For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x)
# kernels for the remaining archs that are not already built for 3x.
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
"7.5;8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
# subtract out the archs that are already built for 3x
list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS})
Expand Down
40 changes: 32 additions & 8 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ ARG CUDA_VERSION=12.4.1
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12
ARG TARGETPLATFORM
ENV DEBIAN_FRONTEND=noninteractive

# Install Python and other dependencies
Expand Down Expand Up @@ -46,9 +47,14 @@ WORKDIR /workspace
# install build and runtime dependencies
COPY requirements-common.txt requirements-common.txt
COPY requirements-cuda.txt requirements-cuda.txt
COPY requirements-cuda-arm64.txt requirements-cuda-arm64.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-cuda.txt

RUN --mount=type=cache,target=/root/.cache/pip \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
python3 -m pip install -r requirements-cuda-arm64.txt; \
fi

# cuda arch list used by torch
# can be useful for both `dev` and `test`
Expand All @@ -63,13 +69,19 @@ ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches}

#################### WHEEL BUILD IMAGE ####################
FROM base AS build
ARG TARGETPLATFORM

# install build dependencies
COPY requirements-build.txt requirements-build.txt

RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-build.txt

RUN --mount=type=cache,target=/root/.cache/pip \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
python3 -m pip install -r requirements-cuda-arm64.txt; \
fi

COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
Expand Down Expand Up @@ -134,15 +146,18 @@ COPY requirements-test.txt requirements-test.txt
COPY requirements-dev.txt requirements-dev.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-dev.txt

#################### DEV IMAGE ####################

#################### vLLM installation IMAGE ####################
# image with vLLM installed
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 AS vllm-base
ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12
WORKDIR /vllm-workspace
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETPLATFORM

COPY requirements-cuda-arm64.txt requirements-cuda-arm64.txt

RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
Expand All @@ -168,18 +183,25 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
# or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/

# install vllm wheel first, so that torch etc will be installed
# Install vllm wheel first, so that torch etc will be installed.
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install dist/*.whl --verbose

RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
pip uninstall -y torch && \
python3 -m pip install -r requirements-cuda-arm64.txt; \
fi

RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
fi
COPY examples examples
#################### vLLM installation IMAGE ####################


#################### TEST IMAGE ####################
# image to run unit testing suite
# note that this uses vllm installed by `pip`
Expand Down Expand Up @@ -209,7 +231,6 @@ COPY vllm/v1 /usr/local/lib/python3.12/dist-packages/vllm/v1
RUN mkdir test_docs
RUN mv docs test_docs/
RUN mv vllm test_docs/

#################### TEST IMAGE ####################

#################### OPENAI API SERVER ####################
Expand All @@ -218,8 +239,11 @@ FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.44.0' timm==0.9.10

if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.42.0' 'timm==0.9.10'; \
else \
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.45.0' 'timm==0.9.10'; \
fi
ENV VLLM_USAGE_SOURCE production-docker-image

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Easy, fast, and cheap LLM serving for everyone
---

*Latest News* 🔥
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
Expand Down Expand Up @@ -133,3 +134,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
* For coordinating contributions and development, please use Slack.
* For security disclosures, please use Github's security advisory feature.
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.

## Media Kit

* If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).
12 changes: 12 additions & 0 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -781,6 +781,7 @@ def main(args: argparse.Namespace):
backend = args.backend
model_id = args.model
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
tokenizer_mode = args.tokenizer_mode

if args.base_url is not None:
api_url = f"{args.base_url}{args.endpoint}"
Expand All @@ -790,6 +791,7 @@ def main(args: argparse.Namespace):
base_url = f"http://{args.host}:{args.port}"

tokenizer = get_tokenizer(tokenizer_id,
tokenizer_mode=tokenizer_mode,
trust_remote_code=args.trust_remote_code)

if args.dataset is not None:
Expand Down Expand Up @@ -1210,5 +1212,15 @@ def main(args: argparse.Namespace):
"from the sampled HF dataset.",
)

parser.add_argument(
'--tokenizer-mode',
type=str,
default="auto",
choices=['auto', 'slow', 'mistral'],
help='The tokenizer mode.\n\n* "auto" will use the '
'fast tokenizer if available.\n* "slow" will '
'always use the slow tokenizer. \n* '
'"mistral" will always use the `mistral_common` tokenizer.')

args = parser.parse_args()
main(args)
Loading
Loading