Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 24 10 28 #248

Merged
merged 105 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
9d9186b
[Frontend] Reduce frequency of client cancellation checking (#7959)
njhill Oct 21, 2024
d621c43
[doc] fix format (#9562)
youkaichao Oct 21, 2024
15713e3
[BugFix] Update draft model TP size check to allow matching target TP…
njhill Oct 21, 2024
711f3a7
[Frontend] Don't log duplicate error stacktrace for every request in …
wallashss Oct 21, 2024
575dceb
[CI] Make format checker error message more user-friendly by using em…
KuntaiDu Oct 21, 2024
ef7faad
:bug: Fixup more test failures from memory profiling (#9563)
joerunde Oct 22, 2024
76a5e13
[core] move parallel sampling out from vllm core (#9302)
youkaichao Oct 22, 2024
b729901
[Bugfix]: serialize config by value for --trust-remote-code (#6751)
tjohnson31415 Oct 22, 2024
f085995
[CI/Build] Remove unnecessary `fork_new_process` (#9484)
DarkLight1337 Oct 22, 2024
29acd2c
[Bugfix][OpenVINO] fix_dockerfile_openvino (#9552)
ngrozae Oct 22, 2024
7469242
[Bugfix]: phi.py get rope_theta from config file (#9503)
Falko1 Oct 22, 2024
c029221
[CI/Build] Replaced some models on tests for smaller ones (#9570)
wallashss Oct 22, 2024
ca30c3c
[Core] Remove evictor_v1 (#9572)
KuntaiDu Oct 22, 2024
f7db5f0
[Doc] Use shell code-blocks and fix section headers (#9508)
rafvasq Oct 22, 2024
0d02747
support TP in qwen2 bnb (#9574)
chenqianfzh Oct 22, 2024
3ddbe25
[Hardware][CPU] using current_platform.is_cpu (#9536)
wangshuai09 Oct 22, 2024
6c5af09
[V1] Implement vLLM V1 [1/N] (#9289)
WoosukKwon Oct 22, 2024
a48e3ec
[CI/Build][LoRA] Temporarily fix long context failure issue (#9579)
jeejeelee Oct 22, 2024
9dbcce8
[Neuron] [Bugfix] Fix neuron startup (#9374)
xendo Oct 22, 2024
bb392ea
[Model][VLM] Initialize support for Mono-InternVL model (#9528)
Isotr0py Oct 22, 2024
08075c3
[Bugfix] Eagle: change config name for fc bias (#9580)
gopalsarda Oct 22, 2024
32a1ee7
[Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212)
zhouyuan Oct 22, 2024
434984e
[Frontend] Support custom request_id from request (#9550)
guoyuhong Oct 22, 2024
cd5601a
[BugFix] Prevent exporting duplicate OpenTelemetry spans (#9017)
ronensc Oct 22, 2024
17c79f3
[torch.compile] auto infer dynamic_arg_dims from type annotation (#9589)
youkaichao Oct 22, 2024
23b899a
[Bugfix] fix detokenizer shallow copy (#5919)
aurickq Oct 22, 2024
cb6fdaa
[Misc] Make benchmarks use EngineArgs (#9529)
JArnoldAMD Oct 22, 2024
d1e8240
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on…
LucasWilkinson Oct 22, 2024
b17046e
[BugFix] Fix metrics error for --num-scheduler-steps > 1 (#8234)
yuleil Oct 22, 2024
208cb34
[Doc]: Update tensorizer docs to include vllm[tensorizer] (#7889)
sethkimmel3 Oct 22, 2024
65050a4
[Bugfix] Generate exactly input_len tokens in benchmark_throughput (#…
heheda12345 Oct 23, 2024
29061ed
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend…
sfc-gh-zhwang Oct 23, 2024
831540c
[Model] Support E5-V (#9576)
DarkLight1337 Oct 23, 2024
51c24c9
[Build] Fix `FetchContent` multiple build issue (#9596)
ProExpertProg Oct 23, 2024
2394962
[Hardware][XPU] using current_platform.is_xpu (#9605)
MengqingCao Oct 23, 2024
3ff57eb
[Model] Initialize Florence-2 language backbone support (#9555)
Isotr0py Oct 23, 2024
c18e1a3
[VLM] Enable overriding whether post layernorm is used in vision enco…
DarkLight1337 Oct 23, 2024
31a08f5
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs…
alex-jw-brooks Oct 23, 2024
e7116c0
[Bugfix] Fix `_init_vision_model` in NVLM_D model (#9611)
DarkLight1337 Oct 23, 2024
dbdd3b5
[misc] comment to avoid future confusion about baichuan (#9620)
youkaichao Oct 23, 2024
e5ac6a4
[Bugfix] Fix divide by zero when serving Mamba models (#9617)
tlrmchlsmth Oct 23, 2024
fd0e2cf
[Misc] Separate total and output tokens in benchmark_throughput.py (#…
mgoin Oct 23, 2024
9013e24
[torch.compile] Adding torch compile annotations to some models (#9614)
CRZbulabula Oct 23, 2024
150b779
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
alex-jw-brooks Oct 23, 2024
fc6c274
[Model] Add Qwen2-Audio model support (#9248)
faychu Oct 23, 2024
b548d7a
[CI/Build] Add bot to close stale issues and PRs (#9436)
russellb Oct 23, 2024
bb01f29
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched mul…
mgoin Oct 24, 2024
b7df53c
[Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
mgoin Oct 24, 2024
33bab41
[Bugfix]: Make chat content text allow type content (#9358)
vrdn-23 Oct 24, 2024
056a68c
[XPU] avoid triton import for xpu (#9440)
yma11 Oct 24, 2024
836e8ef
[Bugfix] Fix PP for ChatGLM and Molmo (#9422)
DarkLight1337 Oct 24, 2024
3770071
[V1][Bugfix] Clean up requests when aborted (#9629)
WoosukKwon Oct 24, 2024
4fdc581
[core] simplify seq group code (#9569)
youkaichao Oct 24, 2024
8a02cd0
[torch.compile] Adding torch compile annotations to some models (#9639)
CRZbulabula Oct 24, 2024
295a061
[Kernel] add kernel for FATReLU (#9610)
jeejeelee Oct 24, 2024
ad6f780
[torch.compile] expanding support and fix allgather compilation (#9637)
CRZbulabula Oct 24, 2024
b979143
[Doc] Move additional tips/notes to the top (#9647)
DarkLight1337 Oct 24, 2024
f584549
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA m…
litianjian Oct 24, 2024
de662d3
Increase operation per run limit for "Close inactive issues and PRs" …
hmellor Oct 24, 2024
d27cfbf
[torch.compile] Adding torch compile annotations to some models (#9641)
CRZbulabula Oct 24, 2024
c866e00
[CI/Build] Fix VLM test failures when using transformers v4.46 (#9666)
DarkLight1337 Oct 24, 2024
722d46e
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints (#…
alex-jw-brooks Oct 24, 2024
e26d37a
[Log][Bugfix] Fix default value check for `image_url.detail` (#9663)
mgoin Oct 24, 2024
5944909
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
charlifu Oct 24, 2024
c91ed47
[Bugfix] Remove xformers requirement for Pixtral (#9597)
mgoin Oct 24, 2024
9f7b4ba
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #…
khluu Oct 25, 2024
a6f3721
[Model] add a lora module for granite 3.0 MoE models (#9673)
willmj Oct 25, 2024
9645b9f
[V1] Support sliding window attention (#9679)
WoosukKwon Oct 25, 2024
ca0d922
[Bugfix] Fix compressed_tensors_moe bad config.strategy (#9677)
mgoin Oct 25, 2024
228cfbd
[Doc] Improve quickstart documentation (#9256)
rafvasq Oct 25, 2024
6567e13
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding (…
tjohnson31415 Oct 25, 2024
067e77f
[Bugfix] Steaming continuous_usage_stats default to False (#9709)
samos123 Oct 26, 2024
5cbdccd
[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9…
MengqingCao Oct 26, 2024
55137e8
Fix: MI100 Support By Bypassing Custom Paged Attention (#9560)
MErkinSag Oct 26, 2024
07e981f
[Frontend] Bad words sampling parameter (#9717)
Alvant Oct 26, 2024
6650e6a
[Model] Add classification Task with Qwen2ForSequenceClassification …
kakao-kevin-us Oct 26, 2024
67a6882
[Misc] SpecDecodeWorker supports profiling (#9719)
Abatom Oct 27, 2024
8549c82
[core] cudagraph output with tensor weak reference (#9724)
youkaichao Oct 27, 2024
3cb07a3
[Misc] Upgrade to pytorch 2.5 (#9588)
bnellnm Oct 27, 2024
e130c40
Fix cache management in "Close inactive issues and PRs" actions workf…
hmellor Oct 27, 2024
34a9941
[Bugfix] Fix load config when using bools (#9533)
madt2709 Oct 27, 2024
4e2d95e
[Hardware][ROCM] using current_platform.is_rocm (#9642)
wangshuai09 Oct 28, 2024
32176fe
[torch.compile] support moe models (#9632)
youkaichao Oct 28, 2024
feb92fb
Fix beam search eos (#9627)
robertgshaw2-neuralmagic Oct 28, 2024
2adb440
[Bugfix] Fix ray instance detect issue (#9439)
yma11 Oct 28, 2024
8b0e4f2
[CI/Build] Adopt Mergify for auto-labeling PRs (#9259)
russellb Oct 28, 2024
5f8d807
[Model][VLM] Add multi-video support for LLaVA-Onevision (#8905)
litianjian Oct 28, 2024
aa0addb
Adding "torch compile" annotations to moe models (#9758)
CRZbulabula Oct 28, 2024
97b61bf
[misc] avoid circular import (#9765)
youkaichao Oct 28, 2024
76ed534
[torch.compile] add deepseek v2 compile (#9775)
youkaichao Oct 28, 2024
b20aa29
Merge remote-tracking branch 'upstream/main'
gshtras Oct 28, 2024
c0eb092
Fix for dynamic quantization of the vision part of llama 3.2; Fix for…
gshtras Oct 28, 2024
c5d7fb9
[Doc] fix third-party model example (#9771)
russellb Oct 29, 2024
7a4df5f
[Model][LoRA]LoRA support added for Qwen (#9622)
jeejeelee Oct 29, 2024
e74f2d4
[Doc] Specify async engine args in docs (#9726)
DarkLight1337 Oct 29, 2024
eae3d48
[Bugfix] Use temporary directory in registry (#9721)
DarkLight1337 Oct 29, 2024
ef7865b
[Frontend] re-enable multi-modality input in the new beam search impl…
FerdinandZhong Oct 29, 2024
09500f7
[Model] Add BNB quantization support for Mllama (#9720)
Isotr0py Oct 29, 2024
2454f4a
Fix support for non quantized visual layers in otherwise quantized ml…
gshtras Oct 29, 2024
622b7ab
[Hardware] using current_platform.seed_everything (#9785)
wangshuai09 Oct 29, 2024
a23a23c
Reorganize imports; Restrict additional supported tensors in _scaled_…
gshtras Oct 29, 2024
b0a8c5d
Merge remote-tracking branch 'upstream/main' into upstream_merge_24_1…
gshtras Oct 29, 2024
ab3f100
Merge remote-tracking branch 'origin/partially_quantized_mllama_fix' …
gshtras Oct 29, 2024
64f51a5
Merge remote-tracking branch 'origin/main' into upstream_merge_24_10_28
gshtras Oct 29, 2024
cfd7388
fix is_hip
gshtras Oct 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Expand Down
57 changes: 57 additions & 0 deletions .github/mergify.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
pull_request_rules:
- name: label-documentation
description: Automatically apply documentation label
conditions:
- or:
- files~=^[^/]+\.md$
- files~=^docs/
actions:
label:
add:
- documentation

- name: label-ci-build
description: Automatically apply ci/build label
conditions:
- files~=^\.github/
- files~=\.buildkite/
- files~=^cmake/
- files=CMakeLists.txt
- files~=^Dockerfile
- files~=^requirements.*\.txt
- files=setup.py
actions:
label:
add:
- ci/build

- name: label-frontend
description: Automatically apply frontend label
conditions:
- files~=^vllm/entrypoints/
actions:
label:
add:
- frontend

- name: ping author on conflicts and add 'needs-rebase' label
conditions:
- conflict
- -closed
actions:
label:
add:
- needs-rebase
comment:
message: |
This pull request has merge conflicts that must be resolved before it can be
merged. @{{author}} please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

- name: remove 'needs-rebase' label when conflict is resolved
conditions:
- -conflict
- -closed
actions:
label:
remove:
- needs-rebase
52 changes: 52 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: 'Close inactive issues and PRs'

on:
schedule:
# Daily at 1:30 AM UTC
- cron: '30 1 * * *'

jobs:
close-issues-and-pull-requests:
permissions:
issues: write
pull-requests: write
actions: write
runs-on: ubuntu-latest
steps:
- uses: actions/stale@28ca1036281a5e5922ead5184a1bbf96e5fc984e # v9.0.0
with:
# Increasing this value ensures that changes to this workflow
# propagate to all issues and PRs in days rather than months
operations-per-run: 1000

exempt-draft-pr: true
exempt-issue-labels: 'keep-open'
exempt-pr-labels: 'keep-open'

labels-to-add-when-unstale: 'unstale'
labels-to-remove-when-stale: 'unstale'

days-before-issue-stale: 90
days-before-issue-close: 30
stale-issue-label: 'stale'
stale-issue-message: >
This issue has been automatically marked as stale because it has not
had any activity within 90 days. It will be automatically closed if no
further activity occurs within 30 days. Leave a comment if
you feel this issue should remain open. Thank you!
close-issue-message: >
This issue has been automatically closed due to inactivity. Please
feel free to reopen if you feel it is still relevant. Thank you!

days-before-pr-stale: 90
days-before-pr-close: 30
stale-pr-label: 'stale'
stale-pr-message: >
This pull request has been automatically marked as stale because it
has not had any activity within 90 days. It will be automatically
closed if no further activity occurs within 30 days. Leave a comment
if you feel this pull request should remain open. Thank you!
close-pr-message: >
This pull request has been automatically closed due to inactivity.
Please feel free to reopen if you intend to continue working on it.
Thank you!
20 changes: 11 additions & 9 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx11
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.4.0")
set(TORCH_SUPPORTED_VERSION_CUDA "2.5.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.5.0")

#
Expand Down Expand Up @@ -196,12 +196,12 @@ endif()

#
# Use FetchContent for C++ dependencies that are compiled as part of vLLM's build process.
# Configure it to place files in vllm/.deps, in order to play nicely with sccache.
# setup.py will override FETCHCONTENT_BASE_DIR to play nicely with sccache.
# Each dependency that produces build artifacts should override its BINARY_DIR to avoid
# conflicts between build types. It should instead be set to ${CMAKE_BINARY_DIR}/<dependency>.
#
include(FetchContent)
get_filename_component(PROJECT_ROOT_DIR "${CMAKE_CURRENT_SOURCE_DIR}" ABSOLUTE)
file(MAKE_DIRECTORY "${FETCHCONTENT_BASE_DIR}")
set(FETCHCONTENT_BASE_DIR "${PROJECT_ROOT_DIR}/.deps")
file(MAKE_DIRECTORY ${FETCHCONTENT_BASE_DIR}) # Ensure the directory exists
message(STATUS "FetchContent base directory: ${FETCHCONTENT_BASE_DIR}")

#
Expand Down Expand Up @@ -229,7 +229,6 @@ set(VLLM_EXT_SRC
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
"csrc/prepare_inputs/advance_step.cu"
"csrc/torch_bindings.cpp")

Expand Down Expand Up @@ -286,7 +285,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Building Marlin kernels for archs: ${MARLIN_ARCHS}")
else()
message(STATUS "Not building Marlin kernels as no compatible archs found"
"in CUDA target architectures")
" in CUDA target architectures")
endif()

#
Expand Down Expand Up @@ -444,6 +443,7 @@ target_compile_definitions(_C PRIVATE CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1)

set(VLLM_MOE_EXT_SRC
"csrc/moe/torch_bindings.cpp"
"csrc/moe/moe_align_sum_kernels.cu"
"csrc/moe/topk_softmax_kernels.cu")

set_gencode_flags_for_srcs(
Expand Down Expand Up @@ -471,7 +471,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_ARCHS}")
else()
message(STATUS "Not building Marlin MOE kernels as no compatible archs found"
"in CUDA target architectures")
" in CUDA target architectures")
endif()
endif()

Expand Down Expand Up @@ -549,8 +549,10 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 013f0c4fc47e6574060879d9734c1df8c5c273bd
GIT_TAG 5259c586c403a4e4d8bf69973c159b40cc346fb9
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
endif()

Expand Down
8 changes: 4 additions & 4 deletions Dockerfile.openvino
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/requirements-build.txt
# build vLLM with OpenVINO backend
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace

COPY examples/ /workspace/vllm/examples
COPY benchmarks/ /workspace/vllm/benchmarks
COPY examples/ /workspace/examples
COPY benchmarks/ /workspace/benchmarks

CMD ["/bin/bash"]
Loading