Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync with 0.7.1 #308

Merged
merged 464 commits into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
464 commits
Select commit Hold shift + click to select a range
a3a3ee4
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_…
jeejeelee Jan 14, 2025
42f5e7c
[Kernel] Support MulAndSilu (#11624)
jeejeelee Jan 15, 2025
1a51b9f
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in se…
kzawora-intel Jan 15, 2025
9ddac56
[Platform] move current_memory_usage() into platform (#11369)
shen-shanshan Jan 15, 2025
b7ee940
[V1][BugFix] Fix edge case in VLM scheduling (#12065)
WoosukKwon Jan 15, 2025
0794e74
[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)
elfiegg Jan 15, 2025
f218f9c
[core] Turn off GPU communication overlap for Ray executor (#12051)
ruisearch42 Jan 15, 2025
ad34c0d
[core] platform agnostic executor via collective_rpc (#11256)
youkaichao Jan 15, 2025
3f9b7ab
[Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
kylesayrs Jan 15, 2025
994fc65
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCache…
heheda12345 Jan 15, 2025
cbe9439
Fix: cases with empty sparsity config (#12057)
rahul-tuli Jan 15, 2025
ad388d2
Type-fix: make execute_model output type optional (#12020)
youngkent Jan 15, 2025
3adf0ff
[Platform] Do not raise error if _Backend is not found (#12023)
wangxiyuan Jan 15, 2025
97eb97b
[Model]: Support internlm3 (#12037)
RunningLeon Jan 15, 2025
5ecf3e0
Misc: allow to use proxy in `HTTPConnection` (#12042)
zhouyuan Jan 15, 2025
de0526f
[Misc][Quark] Upstream Quark format to VLLM (#10765)
kewang-xlnx Jan 15, 2025
57e729e
[Doc]: Update `OpenAI-Compatible Server` documents (#12082)
maang-h Jan 15, 2025
edce722
[Bugfix] use right truncation for non-generative tasks (#12050)
joerunde Jan 15, 2025
70755e8
[V1][Core] Autotune encoder cache budget (#11895)
ywang96 Jan 15, 2025
ebd8c66
[Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
varun-sundar-rabindranath Jan 15, 2025
cd9d06f
Allow hip sources to be directly included when compiling for rocm. (#…
tvirolai-amd Jan 15, 2025
fa0050d
[Core] Default to using per_token quantization for fp8 when cutlass i…
elfiegg Jan 16, 2025
f8ef146
[Doc] Add documentation for specifying model architecture (#12105)
DarkLight1337 Jan 16, 2025
9aa1519
Various cosmetic/comment fixes (#12089)
mgoin Jan 16, 2025
dd7c9ad
[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12…
Isotr0py Jan 16, 2025
bf53e0c
Support torchrun and SPMD-style offline inference (#12071)
youkaichao Jan 16, 2025
92e793d
[core] LLM.collective_rpc interface and RLHF example (#12084)
youkaichao Jan 16, 2025
874f7c2
[Bugfix] Fix max image feature size for Llava-one-vision (#12104)
ywang96 Jan 16, 2025
5fd24ec
[misc] Add LoRA kernel micro benchmarks (#11579)
varun-sundar-rabindranath Jan 16, 2025
62b06ba
[Model] Add support for deepseek-vl2-tiny model (#12068)
Isotr0py Jan 16, 2025
d06e824
[Bugfix] Set enforce_eager automatically for mllama (#12127)
heheda12345 Jan 16, 2025
ebc73f2
[Bugfix] Fix a path bug in disaggregated prefill example script. (#12…
KuntaiDu Jan 17, 2025
fead53b
[CI]add genai-perf benchmark in nightly benchmark (#10704)
jikunshang Jan 17, 2025
1475847
[Doc] Add instructions on using Podman when SELinux is active (#12136)
terrytangyuan Jan 17, 2025
b8bfa46
[Bugfix] Fix issues in CPU build Dockerfile (#12135)
terrytangyuan Jan 17, 2025
d1adb9b
[BugFix] add more `is not None` check in VllmConfig.__post_init__ (#1…
heheda12345 Jan 17, 2025
d75ab55
[Misc] Add deepseek_vl2 chat template (#12143)
Isotr0py Jan 17, 2025
8027a72
[ROCm][MoE] moe tuning support for rocm (#12049)
divakar-amd Jan 17, 2025
69d765f
[V1] Move more control of kv cache initialization from model_executor…
heheda12345 Jan 17, 2025
07934cc
[Misc][LoRA] Improve the readability of LoRA error messages (#12102)
jeejeelee Jan 17, 2025
d4e6194
[CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
bigPYJ1151 Jan 17, 2025
87a0c07
[core] allow callable in collective_rpc (#12151)
youkaichao Jan 17, 2025
58fd57f
[Bugfix] Fix score api for missing max_model_len validation (#12119)
wallashss Jan 17, 2025
54cacf0
[Bugfix] Mistral tokenizer encode accept list of str (#12149)
jikunshang Jan 17, 2025
b5b57e3
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
gshtras Jan 17, 2025
7b98a65
[torch.compile] disable logging when cache is disabled (#12043)
youkaichao Jan 17, 2025
2b83503
[misc] fix cross-node TP (#12166)
youkaichao Jan 18, 2025
c09503d
[AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)
hongxiayang Jan 18, 2025
da02cb4
[core] further polish memory profiling (#12126)
youkaichao Jan 18, 2025
813f249
[Docs] Fix broken link in SECURITY.md (#12175)
russellb Jan 18, 2025
02798ec
[Model] Port deepseek-vl2 processor, remove dependency (#12169)
Isotr0py Jan 18, 2025
6d0e3d3
[core] clean up executor class hierarchy between v1 and v0 (#12171)
youkaichao Jan 18, 2025
32eb0da
[Misc] Support register quantization method out-of-tree (#11969)
ice-tong Jan 19, 2025
7a8a48d
[V1] Collect env var for usage stats (#12115)
simon-mo Jan 19, 2025
4e94951
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#…
madamczykhabana Jan 19, 2025
630eb5b
[Bugfix] Fix multi-modal processors for transformers 4.48 (#12187)
DarkLight1337 Jan 19, 2025
e66faf4
[torch.compile] store inductor compiled Python file (#12182)
youkaichao Jan 19, 2025
936db11
benchmark_serving support --served-model-name param (#12109)
gujingit Jan 19, 2025
edaae19
[Misc] Add BNB support to GLM4-V model (#12184)
Isotr0py Jan 19, 2025
81763c5
[V1] Add V1 support of Qwen2-VL (#12128)
ywang96 Jan 19, 2025
bbe5f9d
[Model] Support for fairseq2 Llama (#11442)
MartinGleize Jan 19, 2025
df450aa
[Bugfix] Fix num_heads value for simple connector when tp enabled (#1…
ShangmingCai Jan 20, 2025
51ef828
[torch.compile] fix sym_tensor_indices (#12191)
youkaichao Jan 20, 2025
3ea7b94
Move linting to `pre-commit` (#11975)
hmellor Jan 20, 2025
c5c0620
[DOC] Fix typo in docstring and assert message (#12194)
terrytangyuan Jan 20, 2025
d264312
[DOC] Add missing docstring in LLMEngine.add_request() (#12195)
terrytangyuan Jan 20, 2025
0974c9b
[Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)
terrytangyuan Jan 20, 2025
8360979
[Model] Add Qwen2 PRM model support (#12202)
Isotr0py Jan 20, 2025
59a0192
[Core] Interface for accessing model from `VllmRunner` (#10353)
DarkLight1337 Jan 20, 2025
5c89a29
[misc] add placeholder format.sh (#12206)
youkaichao Jan 20, 2025
4001ea1
[CI/Build] Remove dummy CI steps (#12208)
DarkLight1337 Jan 20, 2025
3127e97
[CI/Build] Make pre-commit faster (#12212)
DarkLight1337 Jan 20, 2025
b37d827
[Model] Upgrade Aria to transformers 4.48 (#12203)
DarkLight1337 Jan 20, 2025
170eb35
[misc] print a message to suggest how to bypass commit hooks (#12217)
youkaichao Jan 20, 2025
c222f47
[core][bugfix] configure env var during import vllm (#12209)
youkaichao Jan 20, 2025
5f0ec39
[V1] Remove `_get_cache_block_size` (#12214)
heheda12345 Jan 20, 2025
86bfb6d
[Misc] Pass `attention` to impl backend (#12218)
wangxiyuan Jan 20, 2025
18572e3
[Bugfix] Fix `HfExampleModels.find_hf_info` (#12223)
DarkLight1337 Jan 20, 2025
9666369
[CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)
heheda12345 Jan 20, 2025
7bd3630
[Misc] Update CODEOWNERS (#12229)
ywang96 Jan 20, 2025
af69a6a
fix: update platform detection for M-series arm based MacBook process…
isikhi Jan 20, 2025
da75122
[misc] add cuda runtime version to usage data (#12190)
youkaichao Jan 21, 2025
06a760d
[bugfix] catch xgrammar unsupported array constraints (#12210)
Jason-CKY Jan 21, 2025
750f4ca
[Kernel] optimize moe_align_block_size for cuda graph and large num_e…
jinzhen-lin Jan 21, 2025
ecf6781
Add quantization and guided decoding CODEOWNERS (#12228)
mgoin Jan 21, 2025
d4b62d4
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)
gshtras Jan 21, 2025
5fe6bf2
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
NickLucche Jan 21, 2025
2fc6944
[ci/build] disable failed and flaky tests (#12240)
youkaichao Jan 21, 2025
9691255
[Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244)
DarkLight1337 Jan 21, 2025
1f1542a
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#1…
jeejeelee Jan 21, 2025
f2e9f2a
[Misc] Remove redundant TypeVar from base model (#12248)
DarkLight1337 Jan 21, 2025
a94eee4
[Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
DarkLight1337 Jan 21, 2025
c81081f
[torch.compile] transparent compilation with more logging (#12246)
youkaichao Jan 21, 2025
b197a5c
[V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
ywang96 Jan 21, 2025
9a7c3a0
Remove pytorch comments for outlines + compressed-tensors (#12260)
tdoublep Jan 21, 2025
c646128
[Platform] improve platforms getattr (#12264)
MengqingCao Jan 21, 2025
3aec49e
[ci/build] update nightly torch for gh200 test (#12270)
youkaichao Jan 21, 2025
9705b90
[Bugfix] fix race condition that leads to wrong order of token return…
joennlae Jan 21, 2025
1e60f87
[Kernel] fix moe_align_block_size error condition (#12239)
jinzhen-lin Jan 21, 2025
132a132
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
rickyyx Jan 21, 2025
18fd4a8
[Bugfix] Multi-sequence broken (#11898)
andylolu2 Jan 21, 2025
347eeeb
[Misc] Remove experimental dep from tracing.py (#12007)
codefromthecrypt Jan 21, 2025
fa9ee08
[Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
wangxiyuan Jan 21, 2025
9c485d9
[Core] Free CPU pinned memory on environment cleanup (#10477)
janimo Jan 21, 2025
2acba47
[bugfix] moe tuning. rm is_navi() (#12273)
divakar-amd Jan 21, 2025
69196a9
[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…
maleksan85 Jan 21, 2025
09ccc9c
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …
hongxiayang Jan 21, 2025
df76e5a
[VLM] Simplify post-processing of replacement info (#12269)
DarkLight1337 Jan 22, 2025
64ea24d
[ci/lint] Add back default arg for pre-commit (#12279)
khluu Jan 22, 2025
016e367
[CI] add docker volume prune to neuron CI (#12291)
liangfu Jan 22, 2025
cbdc4ad
[Ci/Build] Fix mypy errors on main (#12296)
DarkLight1337 Jan 22, 2025
222a9dc
[Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288)
njhill Jan 22, 2025
66818e5
[core] separate builder init and builder prepare for each batch (#12253)
youkaichao Jan 22, 2025
4004f14
[Build] update requirements of no-device (#12299)
MengqingCao Jan 22, 2025
68ad4e3
[Core] Support fully transparent sleep mode (#11743)
youkaichao Jan 22, 2025
cd7b6f0
[VLM] Avoid unnecessary tokenization (#12310)
DarkLight1337 Jan 22, 2025
528dbca
[Model][Bugfix]: correct Aria model output (#12309)
xffxff Jan 22, 2025
16366ee
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for…
ywang96 Jan 22, 2025
6609cdf
[Doc] Add docs for prompt replacement (#12318)
DarkLight1337 Jan 22, 2025
fc66dee
[Misc] Fix the error in the tip for the --lora-modules parameter (#12…
WangErXiao Jan 22, 2025
84bee4b
[Misc] Improve the readability of BNB error messages (#12320)
jeejeelee Jan 22, 2025
96f6a75
[Bugfix] Fix HPU multiprocessing executor (#12167)
kzawora-intel Jan 22, 2025
7206ce4
[Core] Support `reset_prefix_cache` (#12284)
comaniac Jan 22, 2025
aea9436
[Frontend][V1] Online serving performance improvements (#12287)
njhill Jan 22, 2025
68c4421
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is brok…
rasmith Jan 23, 2025
8d7aa9d
[Bugfix] Fixing AMD LoRA CI test. (#12329)
Alexei-V-Ivanov-AMD Jan 23, 2025
01a5594
[Docs] Update FP8 KV Cache documentation (#12238)
mgoin Jan 23, 2025
7551a34
[Docs] Document vulnerability disclosure process (#12326)
russellb Jan 23, 2025
f0ef372
[V1] Add `uncache_blocks` (#12333)
comaniac Jan 23, 2025
5116274
[doc] explain common errors around torch.compile (#12340)
youkaichao Jan 23, 2025
8ae5ff2
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package u…
zhenwei-intel Jan 23, 2025
c5b4b11
[Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Isotr0py Jan 23, 2025
978b45f
[Kernel] Flash Attention 3 Support (#12093)
LucasWilkinson Jan 23, 2025
d07efb3
[Doc] Troubleshooting errors during model inspection (#12351)
DarkLight1337 Jan 23, 2025
99d01a5
[V1] Simplify M-RoPE (#12352)
ywang96 Jan 23, 2025
8c01b80
[Bugfix] Fix broken internvl2 inference with v1 (#12360)
Isotr0py Jan 23, 2025
3f50c14
[core] add wake_up doc and some sanity check (#12361)
youkaichao Jan 23, 2025
6e650f5
[torch.compile] decouple compile sizes and cudagraph sizes (#12243)
youkaichao Jan 23, 2025
e97f802
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
gshtras Jan 23, 2025
2c85529
[TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
lsy323 Jan 23, 2025
2cbeeda
[Docs] Document Phi-4 support (#12362)
Isotr0py Jan 23, 2025
eb5cb5e
[BugFix] Fix parameter names and `process_after_weight_loading` for W…
dsikka Jan 23, 2025
9726ad6
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
jsato8094 Jan 23, 2025
682b55b
[Docs] Add meetup slides (#12345)
WoosukKwon Jan 23, 2025
c5cffcd
[Docs] Update spec decode + structured output in compat matrix (#12373)
russellb Jan 24, 2025
24b0205
[V1][Frontend] Coalesce bunched `RequestOutput`s (#12298)
njhill Jan 24, 2025
d3d6bb1
Set weights_only=True when using torch.load() (#12366)
russellb Jan 24, 2025
5e5630a
[Bugfix] Path join when building local path for S3 clone (#12353)
omer-dayan Jan 24, 2025
55ef66e
Update compressed-tensors version (#12367)
dsikka Jan 24, 2025
0e74d79
[V1] Increase default batch size for H100/H200 (#12369)
WoosukKwon Jan 24, 2025
6dd94db
[perf] fix perf regression from #12253 (#12380)
youkaichao Jan 24, 2025
3c818bd
[Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
ywang96 Jan 24, 2025
c7c9851
[ci/build] fix wheel size check (#12396)
youkaichao Jan 24, 2025
9a0f3bd
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)
MohitIntel Jan 24, 2025
e784c6b
[ci/build] sync default value for wheel size (#12398)
youkaichao Jan 24, 2025
3bb8e2c
[Misc] Enable proxy support in benchmark script (#12356)
jsato8094 Jan 24, 2025
ab5bbf5
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
LucasWilkinson Jan 24, 2025
df5dafa
[Misc] Remove deprecated code (#12383)
DarkLight1337 Jan 24, 2025
3132a93
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build o…
LucasWilkinson Jan 24, 2025
221d388
[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413)
ElizaWszola Jan 25, 2025
fb30ee9
[Bugfix] Fix BLIP-2 processing (#12412)
DarkLight1337 Jan 25, 2025
bf21481
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
divakar-amd Jan 25, 2025
f1fc051
[Misc] Add FA2 support to ViT MHA layer (#12355)
Isotr0py Jan 25, 2025
324960a
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
lsy323 Jan 25, 2025
2a0309a
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
ywang96 Jan 26, 2025
fa63e71
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync…
youngkent Jan 26, 2025
0ee349b
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
ywang96 Jan 26, 2025
a525527
[Misc] Revert FA on ViT #12355 and #12435 (#12445)
ywang96 Jan 26, 2025
9ddc352
[Frontend] generation_config.json for maximum tokens(#12242)
mhendrey Jan 26, 2025
aa2cd2c
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
tlrmchlsmth Jan 26, 2025
72f4880
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
tlrmchlsmth Jan 26, 2025
68f1114
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
LucasWilkinson Jan 26, 2025
72bac73
[Build/CI] Fix libcuda.so linkage (#12424)
tlrmchlsmth Jan 26, 2025
0034b09
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
K-Mistele Jan 27, 2025
582cf78
[DOC] Add link to vLLM blog (#12460)
terrytangyuan Jan 27, 2025
28e0750
[V1] Avoid list creation in input preparation (#12457)
WoosukKwon Jan 27, 2025
0cc6b38
[Frontend] Support scores endpoint in run_batch (#12430)
pooyadavoodi Jan 27, 2025
5204ff5
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
DarkLight1337 Jan 27, 2025
372bf08
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Isotr0py Jan 27, 2025
624a1e4
[V1][Minor] Minor optimizations for update_from_output (#12454)
WoosukKwon Jan 27, 2025
ce69f7f
[Bugfix] Fix gpt2 GGUF inference (#12467)
Isotr0py Jan 27, 2025
103bd17
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
LucasWilkinson Jan 27, 2025
01ba927
[V1][Metrics] Add initial Prometheus logger (#12416)
markmc Jan 27, 2025
3f1fc74
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
WoosukKwon Jan 27, 2025
2bc3fbb
[FlashInfer] Upgrade to 0.2.0 (#11194)
abmfy Jan 27, 2025
6116ca8
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logp…
NickLucche Jan 27, 2025
823ab79
Update `pre-commit` hooks (#12475)
hmellor Jan 28, 2025
ddee88d
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache…
liangfu Jan 28, 2025
426a5c3
Fix bad path in prometheus example (#12481)
mgoin Jan 28, 2025
23a7cbc
[CI/Build] Fixed the xla nightly issue report in #12451 (#12453)
hosseinsarshar Jan 28, 2025
0f465ab
[FEATURE] Enables offline /score for embedding models (#12021)
gmarinho2 Jan 28, 2025
dd66fd2
[CI] fix pre-commit error (#12494)
MengqingCao Jan 28, 2025
8cbc424
Update README.md with V1 alpha release (#12495)
ywang96 Jan 28, 2025
e29d435
[V1] Include Engine Version in Logs (#12496)
robertgshaw2-redhat Jan 28, 2025
2079e43
[Core] Make raw_request optional in ServingCompletion (#12503)
schoennenbeck Jan 28, 2025
8be5fb0
Sync with upstream @ v0.7.0
dtrifiro Jan 27, 2025
335a434
Dockerfile.ubi: fix build
dtrifiro Jan 28, 2025
8f58a51
[VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504)
DarkLight1337 Jan 28, 2025
925d2f1
[Doc] Fix typo for x86 CPU installation (#12514)
waltforme Jan 28, 2025
3fd1fb6
[V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478)
markmc Jan 28, 2025
0f657bd
Replace missed warning_once for rerank API (#12472)
mgoin Jan 28, 2025
f26d790
Do not run `suggestion` `pre-commit` hook multiple times (#12521)
hmellor Jan 28, 2025
c386c43
[V1][Metrics] Add per-request prompt/generation_tokens histograms (#1…
markmc Jan 28, 2025
80fcc3e
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernel…
fenghuizhang Jan 28, 2025
fbb5bd4
[TPU] Add example for profiling TPU inference (#12531)
mgoin Jan 29, 2025
a7e3eba
[Frontend] Support reasoning content for deepseek r1 (#12473)
gaocegege Jan 29, 2025
dd6a3a0
[Doc] Convert docs to use colon fences (#12471)
hmellor Jan 29, 2025
46fb056
[V1][Metrics] Add TTFT and TPOT histograms (#12530)
markmc Jan 29, 2025
bd02164
Bugfix for whisper quantization due to fake k_proj bias (#12524)
mgoin Jan 29, 2025
5f671cb
[V1] Improve Error Message for Unsupported Config (#12535)
robertgshaw2-redhat Jan 29, 2025
ef001d9
Fix the pydantic logging validator (#12420)
maxdebayser Jan 29, 2025
036ca94
[Bugfix] handle alignment of arguments in convert_sparse_cross_attent…
tjohnson31415 Jan 29, 2025
d93bf4d
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vL…
HwwwwwwwH Jan 29, 2025
ff7424f
[Frontend] Support override generation config in args (#12409)
liuyanyi Jan 29, 2025
b02fd28
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama mo…
pavanimajety Jan 29, 2025
27b78c7
[Kernel] add triton fused moe kernel for gptq/awq (#12185)
jinzhen-lin Jan 29, 2025
73aa6cf
Revert "[Build/CI] Fix libcuda.so linkage" (#12552)
tlrmchlsmth Jan 29, 2025
e0cc5f2
[V1][BugFix] Free encoder cache for aborted requests (#12545)
WoosukKwon Jan 29, 2025
1c1bb0b
[Misc][MoE] add Deepseek-V3 moe tuning support (#12558)
divakar-amd Jan 30, 2025
f17f1d4
[V1][Metrics] Add GPU cache usage % gauge (#12561)
markmc Jan 30, 2025
a276903
Set `?device={device}` when changing tab in installation guides (#12560)
hmellor Jan 30, 2025
41bf561
[Misc] fix typo: add missing space in lora adapter error message (#12…
Beim Jan 30, 2025
9b0c4ba
[Kernel] Triton Configs for Fp8 Block Quantization (#11589)
robertgshaw2-redhat Jan 30, 2025
bd2107e
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)
npanpaliya Jan 30, 2025
4078052
[V1][Log] Add max request concurrency log to V1 (#12569)
mgoin Jan 30, 2025
9798b2f
[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) s…
LucasWilkinson Jan 31, 2025
a1fc18c
[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)
maleksan85 Jan 31, 2025
cabaf4e
[Attention] MLA decode optimizations (#12528)
LucasWilkinson Jan 31, 2025
7a8987d
[Bugfix] Gracefully handle huggingface hub http error (#12571)
ywang96 Jan 31, 2025
e3f7ff6
Add favicon to docs (#12611)
hmellor Jan 31, 2025
325f679
[BugFix] Fix Torch.Compile For DeepSeek (#12594)
robertgshaw2-redhat Jan 31, 2025
847f883
[Git] Automatically sign-off commits (#12595)
comaniac Jan 31, 2025
60bcef0
[Docs][V1] Prefix caching design (#12598)
comaniac Jan 31, 2025
89003c4
[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)
heheda12345 Jan 31, 2025
415f194
[release] Add input step to ask for Release version (#12631)
khluu Jan 31, 2025
145c2ff
[Bugfix] Revert MoE Triton Config Default (#12629)
robertgshaw2-redhat Jan 31, 2025
eb5741a
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for …
tlrmchlsmth Jan 31, 2025
fc54214
[Feature] Fix guided decoding blocking bitmask memcpy (#12563)
xpbowler Jan 31, 2025
60808bd
[Doc] Improve installation signposting (#12575)
hmellor Jan 31, 2025
44bbca7
[Doc] int4 w4a16 example (#12585)
brian-dellabetta Jan 31, 2025
b1340f9
[V1] Bugfix: Validate Model Input Length (#12600)
robertgshaw2-redhat Feb 1, 2025
cb3e73e
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (…
sleepwalker2017 Feb 1, 2025
1867c25
Fix target matching for fused layers with compressed-tensors (#12617)
eldarkurtic Feb 1, 2025
35b7a05
[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599)
khluu Feb 1, 2025
cfa134d
[Bugfix/CI] Fixup benchmark_moe.py (#12562)
tlrmchlsmth Feb 1, 2025
3e1c76c
Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517)
rahul-tuli Feb 1, 2025
baeded2
[Attention] Deepseek v3 MLA support with FP8 compute (#12601)
LucasWilkinson Feb 1, 2025
1e36983
[CI/Build] Add label automation for structured-output, speculative-de…
russellb Feb 1, 2025
4f4d427
Disable chunked prefill and/or prefix caching when MLA is enabled (#…
simon-mo Feb 1, 2025
0a63af8
Sync with upstream @ v0.7.1
dtrifiro Feb 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 5 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))


def print_top_10_largest_files(zip_file):
Expand Down
7 changes: 4 additions & 3 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
steps:
- label: "Wait for container to be ready"
key: wait-for-container-image
agents:
queue: A100
plugins:
Expand All @@ -10,12 +11,11 @@ steps:
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
plugins:
- kubernetes:
podSpec:
Expand Down Expand Up @@ -49,6 +49,7 @@ steps:
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
Expand All @@ -73,7 +74,7 @@ steps:
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: block-h100
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ main() {



# The figures should be genereated by a separate process outside the CI/CD pipeline
# The figures should be generated by a separate process outside the CI/CD pipeline

# # generate figures
# python3 -m pip install tabulate pandas matplotlib
Expand Down
107 changes: 107 additions & 0 deletions .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,104 @@ run_serving_tests() {
kill_gpu_processes
}

run_genai_perf_tests() {
# run genai-perf tests

# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
genai_perf_test_file=$1

# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')

# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi

# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}

# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')

# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"

# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi

if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi

if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi

# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps=$num_prompts
echo "now qps is $qps"
fi

new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE

if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
#TODO: add output dir.
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend vllm \
--endpoint-type chat \
--streaming \
--url localhost:$port \
--request-rate $qps \
--num-prompts $num_prompts \
"

echo "Client command: $client_command"

eval "$client_command"

#TODO: process/record outputs
done
done

kill_gpu_processes

}

prepare_dataset() {

Expand Down Expand Up @@ -328,12 +426,17 @@ main() {

pip install -U transformers

pip install -r requirements-dev.txt
which genai-perf

# check storage
df -h

ensure_installed wget
ensure_installed curl
ensure_installed jq
# genai-perf dependency
ensure_installed libb64-0d

prepare_dataset

Expand All @@ -345,6 +448,10 @@ main() {
# run the test
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"

# run genai-perf tests
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
mv artifacts/ $RESULTS_FOLDER/

# upload benchmark results to buildkite
python3 -m pip install tabulate pandas
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
Expand Down
23 changes: 23 additions & 0 deletions .buildkite/nightly-benchmarks/tests/genai-perf-tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[
{
"test_name": "llama8B_tp1_genai_perf",
"qps_list": [4,8,16,32],
"common_parameters": {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"tp": 1,
"port": 8000,
"num_prompts": 500,
"reuse_server": false
},
"vllm_server_parameters": {
"disable_log_stats": "",
"disable_log_requests": "",
"gpu_memory_utilization": 0.9,
"num_scheduler_steps": 10,
"max_num_seqs": 512,
"dtype": "bfloat16"
},
"genai_perf_input_parameters": {
}
}
]
9 changes: 7 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ steps:
env:
DOCKER_BUILDKIT: "1"

- input: "Provide Release version here"
fields:
- text: "What is the release version?"
key: "release-version"

- block: "Build CPU release image"
key: block-cpu-release-image-build
depends_on: ~
Expand All @@ -66,7 +71,7 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
41 changes: 22 additions & 19 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,63 +9,60 @@ CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}

# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2

function cpu_tests() {
set -e
export NUMA_NODE=$2

# offline inference
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"
python3 examples/offline_inference/basic.py"

# Run basic model test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -r vllm/requirements-test.txt
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# Run compressed-tensor test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"

# online inference
docker exec cpu-test-"$NUMA_NODE" bash -c "
# online serving
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
Expand All @@ -78,8 +75,14 @@ function cpu_tests() {
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"

# Run multi-lora tests
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/lora/test_qwen2vl.py"
}

# All of CPU tests are expected to be finished less than 25 mins.
# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
timeout 40m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
2 changes: 1 addition & 1 deletion .buildkite/run-gh200-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@ remove_docker_container

# Run the image and test offline inference
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference.py
python3 examples/offline_inference/basic.py
'
12 changes: 10 additions & 2 deletions .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,17 @@ set -ex
docker build -t hpu-test-env -f Dockerfile.hpu .

# Setup cleanup
# certain versions of HPU software stack have a bug that can
# override the exit code of the script, so we need to use
# separate remove_docker_container and remove_docker_container_and_exit
# functions, while other platforms only need one remove_docker_container
# function.
EXITCODE=1
remove_docker_container() { docker rm -f hpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
trap remove_docker_container_and_exit EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
EXITCODE=$?
Loading