Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add real bs to profiling #600

Closed
wants to merge 917 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
917 commits
Select commit Hold shift + click to select a range
fd9f124
[Doc] fix link for page that was renamed (#10455)
russellb Nov 19, 2024
803f37e
[6/N] torch.compile rollout to users (#10437)
youkaichao Nov 19, 2024
efa9084
[Core] Avoid metrics log noise when idle (#8868)
russellb Nov 19, 2024
b00b33d
[Model][Quantization] HQQ support through Marlin kernel expansion (#9…
ElizaWszola Nov 19, 2024
a324d3a
Change granite chat template to keep json list formatting for tool ca…
maxdebayser Nov 20, 2024
d5b68ab
[CI/Build] Update Dockerfile.rocm (#10434)
Alexei-V-Ivanov-AMD Nov 20, 2024
d200972
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
LucasWilkinson Nov 20, 2024
9e05252
[Misc] Add __setitem__ for LazyDict (#10469)
liuyanyi Nov 20, 2024
ad44437
[Bugfix] Fix Mamba model initialization and MLP Speculator weights lo…
Isotr0py Nov 20, 2024
b4be5a8
[Bugfix] Enforce no chunked prefill for embedding models (#10470)
DarkLight1337 Nov 20, 2024
709c9f1
[CI/Build] Add sphinx/rst linter for docs (#10366)
rafvasq Nov 20, 2024
7629a9c
[CI/Build] Support compilation with local cutlass path (#10423) (#10424)
wchen61 Nov 20, 2024
ed701ca
[ci/build] Combine nightly and optional (#10465)
khluu Nov 20, 2024
343041c
[model] Reduce medusa weight (#10454)
skylee-01 Nov 20, 2024
09dbf9f
[Bugfix] Handle conflicts between modern and legacy fields (#10471)
DarkLight1337 Nov 20, 2024
d5b2844
[Platforms] Refactor xpu code (#10468)
MengqingCao Nov 20, 2024
63f1fde
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#1…
bigPYJ1151 Nov 20, 2024
8c3f56a
Update ray_hpu_executor.py (#522)
michalkuligowski Nov 20, 2024
6338608
Random sampler warmup (#506)
mfylcek Nov 20, 2024
efe0268
Skip empty steps in multi step sheduling (#526)
jkaniecki Nov 20, 2024
772a667
[platforms] restore xpu check for parallel config (#10479)
youkaichao Nov 20, 2024
5f1d6af
[perf bench] H200 development (#9768)
simon-mo Nov 20, 2024
0cd3d97
[7/N] torch.compile, reduce compilation time (#10460)
youkaichao Nov 20, 2024
c68f7ed
[Bugfix]: allow extra fields in requests to openai compatible server …
gcalmettes Nov 20, 2024
2f77b6c
[TPU] Implement prefix caching for TPUs (#10307)
WoosukKwon Nov 20, 2024
388ee3d
[torch.compile] limit inductor threads and lazy import quant (#10482)
youkaichao Nov 21, 2024
6c1208d
[Core] Add Sliding Window Support with Flashinfer (#10462)
pavanimajety Nov 21, 2024
9d82717
[Platforms] Add `device_type` in `Platform` (#10508)
MengqingCao Nov 21, 2024
8b0fe06
[torch.compile] Inductor code caching fix (#10273)
ProExpertProg Nov 21, 2024
3430857
[Misc] Increase default video fetch timeout (#10495)
DarkLight1337 Nov 21, 2024
aaddce5
[platforms] improve error message for unspecified platforms (#10520)
youkaichao Nov 21, 2024
f0e0238
[Doc] fix a small typo in docstring of llama_tool_parser (#10513)
FerdinandZhong Nov 21, 2024
f481707
[bucketing overhaul 2/n] Delegate bucket management to HPUBucketingCo…
kdamaszk Nov 21, 2024
1cfde82
[Model] Add Support for Multimodal Granite Models (#10291)
alex-jw-brooks Nov 21, 2024
8a93a59
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (…
sywangyi Nov 21, 2024
d5ec121
[Model] Expose `dynamic_image_size` as mm_processor_kwargs for Intern…
Isotr0py Nov 21, 2024
4d676f0
[Bugfix] Embedding model pooling_type equals ALL and multi input's bu…
BBuf Nov 21, 2024
425d0be
[SW-201504] Adding Test Trigger (#533)
RonBenMosheHabana Nov 21, 2024
da7e702
[Bug]: When apply continue_final_message for OpenAI server, the "echo…
chaunceyjiang Nov 21, 2024
2385b60
[Kernel] Register punica ops directly (#10522)
jeejeelee Nov 21, 2024
c51e397
[Misc] Suppress duplicated logging regarding multimodal input pipelin…
ywang96 Nov 21, 2024
e7a8341
[Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536)
DarkLight1337 Nov 21, 2024
7560ae5
[8/N] enable cli flag without a space (#10529)
youkaichao Nov 21, 2024
f9310cb
[V1] Fix Compilation config & Enable CUDA graph by default (#10528)
WoosukKwon Nov 21, 2024
edec338
[CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535)
cermeng Nov 21, 2024
cf656f5
[misc] improve error message (#10553)
youkaichao Nov 21, 2024
46fe9b4
[Minor] Revert change in offline inference example (#10545)
WoosukKwon Nov 21, 2024
9afa014
Add small example to metrics.rst (#10550)
mgoin Nov 21, 2024
aed0748
[Benchmark] Add new H100 machine (#10547)
simon-mo Nov 22, 2024
33e0a25
[9/N] torch.compile LLM usage (#10552)
youkaichao Nov 22, 2024
446c780
[Minor] Fix line-too-long (#10563)
WoosukKwon Nov 22, 2024
a111d01
[platforms] absorb worker cls difference into platforms folder (#10555)
youkaichao Nov 22, 2024
b6374e0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948)
Isotr0py Nov 22, 2024
11fcf0e
Remove token-adding chat embedding params (#10551)
noamgat Nov 22, 2024
0d153cf
[SW-201504] Add Jenkins Tests Trigger (#537)
RonBenMosheHabana Nov 22, 2024
dbde4b8
[bucketing overhaul 3/n] Move HPUBucketingContext to vllm-hpu-extensi…
kdamaszk Nov 22, 2024
db100c5
[bugfix] fix full graph tests (#10581)
youkaichao Nov 22, 2024
eebad39
[torch.compile] support all attention backends (#10558)
youkaichao Nov 22, 2024
97814fb
[v1] Refactor KVCacheManager for more hash input than token ids (#10507)
rickyyx Nov 22, 2024
948c859
support bitsandbytes quantization with qwen model (#10549)
zixuanzhang226 Nov 23, 2024
28598f3
[Core] remove temporary local variables in LLMEngine.__init__ (#10577)
russellb Nov 23, 2024
d345f40
[V1] EngineCore supports profiling (#10564)
Abatom Nov 23, 2024
d559979
[bugfix] fix cpu tests (#10585)
youkaichao Nov 23, 2024
9195dbd
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-To…
tjohnson31415 Nov 23, 2024
ebda519
[Core] Fix broken log configuration (#10458)
russellb Nov 23, 2024
978b397
[Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432)
tlrmchlsmth Nov 23, 2024
4aba6e3
[core] gemma2 full context length support (#10584)
youkaichao Nov 23, 2024
7d8ffb3
[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
shenoyvvarun Nov 23, 2024
cfea9c0
[Model] Fix Baichuan BNB online quantization (#10572)
CNTRYROA Nov 23, 2024
02a43f8
Update default max_num_batch_tokens for chunked prefill to 2048 (#10544)
mgoin Nov 23, 2024
7c25fe4
[AMD] Add support for GGUF quantization on ROCm (#10254)
kliuae Nov 23, 2024
4634a89
Prefix Cache Aware Scheduling [1/n] (#10128)
rickyyx Nov 23, 2024
c8acd80
[2/N] handling placeholders in merged multi-modal processor (#10485)
DarkLight1337 Nov 23, 2024
4cfe5d2
[Bugfix] `multi_modal_kwargs` broadcast for CPU tensor parallel (#10541)
Isotr0py Nov 23, 2024
86a44fb
[Platforms] Refactor openvino code (#10573)
statelesshz Nov 23, 2024
651f6c3
For ppc64le, disabled tests for now and addressed space issues (#10538)
npanpaliya Nov 23, 2024
04668eb
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593)
Isotr0py Nov 23, 2024
17d8fc1
[bugfix] Fix example/tensorize_vllm_model tests (#10595)
jeejeelee Nov 24, 2024
1700c54
[Bugfix] Fix LoRA weight sharding (#10450)
jeejeelee Nov 24, 2024
1c445dc
[CI/Build] Print running script to enhance CI log readability (#10594)
jeejeelee Nov 24, 2024
eda2b35
Revert "Print running script to enhance CI log readability" (#10601)
youkaichao Nov 24, 2024
c055747
[model][utils] add extract_layer_index utility function (#10599)
youkaichao Nov 24, 2024
e4fbb14
[doc] update the code to add models (#10603)
youkaichao Nov 24, 2024
49628fe
[Doc] Update README.md with Ray Summit talk links (#10610)
zhuohan123 Nov 25, 2024
214efc2
Support Cross encoder models (#10400)
maxdebayser Nov 25, 2024
7ea3cd7
[Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614)
MengqingCao Nov 25, 2024
571841b
[torch.compile] support encoder based models (#10613)
youkaichao Nov 25, 2024
a30a605
[Doc] Add encoder-based models to Supported Models page (#10616)
DarkLight1337 Nov 25, 2024
7c2134b
[torch.compile] force inductor threads (#10620)
jeejeelee Nov 25, 2024
6581378
[torch.compile] add warning for unsupported models (#10622)
youkaichao Nov 25, 2024
25d806e
[misc] add torch.compile compatibility check (#10618)
youkaichao Nov 25, 2024
39c6b6c
Limit decode block size (#532)
mfylcek Nov 25, 2024
5eb8b1f
fix marlin flag set on hpu (#540)
nirda7 Nov 25, 2024
05d1f8c
[misc] move functions to config.py (#10624)
youkaichao Nov 25, 2024
ed46f14
[Model] Support `is_causal` HF config field for Qwen2 model (#10621)
DarkLight1337 Nov 25, 2024
2b0879b
Super tiny little typo fix (#10633)
fzyzcjy Nov 25, 2024
d04b13a
[Bug]: Authorization ignored when root_path is set (#10606)
chaunceyjiang Nov 25, 2024
c27df94
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devic…
wallashss Nov 25, 2024
452a4e8
[Docs] Add Snowflake Slides (#10641)
simon-mo Nov 25, 2024
b1d9205
[Model]: Add support for Aria model (#10514)
xffxff Nov 25, 2024
cf73f0c
[Model] Enable optional prefix when loading embedding models (#10639)
DarkLight1337 Nov 25, 2024
1b583cf
[Doc] Fix typos in docs (#10636)
DarkLight1337 Nov 25, 2024
9db713a
[Model] Add OLMo November 2024 model (#10503)
2015aroras Nov 25, 2024
6e9ff05
[misc] do not read HOST_IP (#10644)
youkaichao Nov 26, 2024
45ac4ff
[bugfix] fix aria model and add torch.compile (#10645)
youkaichao Nov 26, 2024
a6760f6
[Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
sanketkaleoss Nov 26, 2024
519e8e4
[v1] EngineArgs for better config handling for v1 (#10382)
rickyyx Nov 26, 2024
9a88f89
custom allreduce + torch.compile (#10121)
SageMoore Nov 26, 2024
9406353
[Misc] Remove outdated init protocols (#10655)
DarkLight1337 Nov 26, 2024
334d64d
[ci] add vllm_test_utils (#10659)
youkaichao Nov 26, 2024
0f513bd
Fix profile run for multi LoRA (#549)
kdamaszk Nov 26, 2024
7133502
fix cutlass_fp8_supported flag set on hpu
nirda7 Nov 26, 2024
38c2d10
Fix cutlass_fp8_supported flag set on HPU (#550)
nirda7 Nov 26, 2024
b62f1b2
[HPU] Add mark_step configurable for the decoder layer. (#525)
jiminha Nov 26, 2024
633df59
Update cpu-test.yml (#544)
michalkuligowski Nov 26, 2024
4d8185f
Update *.sh (#545)
michalkuligowski Nov 26, 2024
1f6584e
[V1] Enable profile for LLMEngine (#10665)
jikunshang Nov 26, 2024
3f0b0e4
Update run-lm-eval-gsm-vllm-baseline.sh (#552)
michalkuligowski Nov 26, 2024
b099337
Add HPU information to collect_env script (#430)
michalkuligowski Nov 26, 2024
b7d75b8
Intern2 habana (#489)
skirdey-inflection Nov 26, 2024
db66e01
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
andoorve Nov 26, 2024
677741e
Added hpu as device argument
rsshaik1 Nov 26, 2024
f5792c7
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
conroy-cheers Nov 26, 2024
9a99273
[Bugfix] Fix using `-O[0,3]` with LLM entrypoint (#10677)
mgoin Nov 26, 2024
7576cd3
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642)
mgoin Nov 26, 2024
2f0a0a1
[V1] Refactor model executable interface for multimodal models (#10570)
ywang96 Nov 26, 2024
0a71900
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
xuechendi Nov 27, 2024
0a4d968
[V1] Update interface for idefics3 (#10680)
ywang96 Nov 27, 2024
1bf905d
[Bugfix][SpecDecode] apply sampling parameters to target probabilitie…
jeongin601 Nov 27, 2024
cfb3bf2
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesC…
yansh97 Nov 27, 2024
0c62b0b
Added "hpu" as configurable device argument in test_lora_manager_hpu …
vivekgoe Nov 27, 2024
e85250b
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
jikunshang Nov 27, 2024
15cc2a9
[Misc]Further reduce BNB static variable (#10597)
jeejeelee Nov 27, 2024
e225110
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
tlrmchlsmth Nov 27, 2024
1209261
[Model] Support telechat2 (#10311)
shunxing12345 Nov 27, 2024
418cb3b
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
bigPYJ1151 Nov 27, 2024
9e0a147
[V1] Update interface for mistral-format Pixtral (#10703)
ywang96 Nov 27, 2024
308cc5e
[ci] fix slow tests (#10698)
youkaichao Nov 27, 2024
c411def
[torch.compile] fix shape specialization (#10722)
youkaichao Nov 27, 2024
b98c62b
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Isotr0py Nov 27, 2024
197b448
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
mzusman Nov 27, 2024
9b4b150
[Bugfix] Ignore `lm_head` when loading embedding models (#10719)
DarkLight1337 Nov 27, 2024
395b1c7
[Frontend] don't block event loop in tokenization (preprocess) in Ope…
tomeras91 Nov 27, 2024
cb4e1c3
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
70dc14f
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
278be67
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d9b4b3f
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
a79b122
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
756485f
[BUG FIX] [SPEC DECODE] 0.6.4 rebase cause incorrectness in spec deco…
xuechendi Nov 28, 2024
d83b62f
CI fix (#563)
tzielinski-habana Nov 28, 2024
9a8bff0
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
3ed5e73
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
5fc5ce0
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
637bb57
Set vllm-hpu-extension to 50e10ea (#565)
mswiniarsk Nov 28, 2024
8c1e77f
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
98f47f2
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
c83919c
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
fa6ecb9
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
c82b432
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
cff5c7f
Refactor FP8 Inc config and flow (#564)
nirda7 Nov 29, 2024
f295f07
Set vllm-hpu-extension to bc01901
iboiko-habana Nov 29, 2024
2aeea0b
Set vllm-hpu-extension to bc01901 (#567)
iboiko-habana Nov 29, 2024
3132aac
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
cef2df0
to make repetition penalty faster (#442)
ccrhx4 Nov 29, 2024
661175b
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
49c9efa
Enable alibi fusedsdpa (#561)
itaraban Nov 29, 2024
40bc242
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
16ee07f
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
e7cfc4e
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
7e4bbda
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
1337071
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
f877a7d
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d2f058e
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
169a0ff
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
c11f172
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
0590ec3
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
b18c9bb
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
b795477
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
073a4bd
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
e25810a
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
63a1641
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
995a148
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
ef31eab
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
e95f275
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
a4c4daf
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
56da9fc
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Dec 2, 2024
e438503
fix syntax error
kzawora-intel Dec 2, 2024
4b502a6
Set vllm-hpu-extension to fb36408 (#572)
mswiniarsk Dec 2, 2024
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
3cb5420
Set vllm-hpu-extension to cd520df (#574)
mswiniarsk Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
1440f45
Revert "to make repetition penalty faster" (#570)
michalkuligowski Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-neuralmagic Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
b9d6f69
Regional compilation support (#576)
Kacper-Pietkun Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
4796d16
Revert "Enable alibi fusedsdpa" (#585)
madamczykhabana Dec 4, 2024
8c76728
Prepare sin/cos buffers for rope outside model forward (#566)
tzielinski-habana Dec 4, 2024
f6865f4
Enable DeepseekV2 Lite/Chat models (#516)
hlin99 Dec 4, 2024
8754e17
Set vllm-hpu-extension to 070591a (#591)
mswiniarsk Dec 4, 2024
01d079f
[LoRA] Change lora_tokenizers capacity (#10796)
xyang16 Dec 4, 2024
10398b4
[Model] Consolidate ViTs attention implementation without mask (#10893)
Isotr0py Dec 4, 2024
82eb5ea
Benchmark serving structured output (#10880)
xuechendi Dec 4, 2024
e4c34c2
[CI/Build] improve python-only dev setup (#9621)
dtrifiro Dec 4, 2024
2a56e12
[V1] Fix when max_model_len is not divisible by block_size (#10903)
WoosukKwon Dec 5, 2024
7883c2b
[benchmark] Make H100 benchmark optional (#10908)
khluu Dec 5, 2024
8d370e9
[Bugfix] Fallback to outlines for complex json schemas (#10899)
mgoin Dec 5, 2024
aa39a8e
[Doc] Create a new "Usage" section (#10827)
DarkLight1337 Dec 5, 2024
1f958a7
[Bugfix] Fix BNB loader target_modules (#10720)
jeejeelee Dec 5, 2024
39c89e7
[Misc] Update llama 3.2 template to support system prompt with images…
tjohnson31415 Dec 5, 2024
ad29332
[CI/BUILD] Spec decode ci (#524)
xuechendi Dec 5, 2024
571da8f
[Misc][LoRA] Clean up the function interface of Punica (#10917)
jeejeelee Dec 5, 2024
998eeaf
[CI/Build] Bump test transformers version (#10106)
Isotr0py Dec 5, 2024
a430652
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
kzawora-intel Dec 5, 2024
9743d64
[ci][build] add tests for python only compilation (#10915)
youkaichao Dec 5, 2024
db87eb6
[torch.compile] use size tuning for specific sizes (#10933)
youkaichao Dec 6, 2024
b031a45
[torch.compile] add logging for compilation time (#10941)
youkaichao Dec 6, 2024
a805205
Add host traces to high-level profilings (#577)
szutenberg Dec 6, 2024
222f5b0
[CI/Build] Fix broken multimodal test (#10950)
DarkLight1337 Dec 6, 2024
a1887f2
[torch.compile] fix deprecated code (#10948)
youkaichao Dec 6, 2024
e349f70
Enable patching Fused SDPA (#569)
nirda7 Dec 6, 2024
6a4f673
revert INC fixed version installation in requirements-hpu.txt for 1.1…
xuechendi Dec 6, 2024
e0e47ed
Add multiprocessing HPU executor (#559)
kzawora-intel Dec 6, 2024
858e0a0
fix WorkerWrapperBase and spec_decode rebase (#582)
xuechendi Dec 6, 2024
21323ed
Merge remote-tracking branch 'origin/habana_main' into HEAD
kzawora-intel Dec 6, 2024
d8f395e
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Dec 6, 2024
48ab12b
fix mypy errors
kzawora-intel Dec 6, 2024
9204975
fix (hopefully) all linter errors
kzawora-intel Dec 6, 2024
ad8d5b7
Dec 06 rebase (#571)
kzawora-intel Dec 9, 2024
bcf66c5
Add real BS and seq_len to profiling
kamil-kaczor Dec 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
65 changes: 48 additions & 17 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,19 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand All @@ -41,20 +44,48 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- block: "Run H100 Benchmark"
key: block-h100
depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: block-h100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,18 @@ def results_to_json(latency, throughput, serving):
throughput_results,
serving_results)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue

# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)

# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
Expand Down
63 changes: 25 additions & 38 deletions .buildkite/nightly-benchmarks/scripts/launch-server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,58 +50,54 @@ launch_trt_server() {
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git lfs install
cd tensorrtllm_backend
git checkout $trt_llm_version
tensorrtllm_backend_dir=$(pwd)
git checkout "$trt_llm_version"
git submodule update --init --recursive

# build trtllm engine
cd /tensorrtllm_backend
cd ./tensorrt_llm/examples/${model_type}
cd "./tensorrt_llm/examples/${model_type}"
python3 convert_checkpoint.py \
--model_dir ${model_path} \
--dtype ${model_dtype} \
--tp_size ${model_tp_size} \
--output_dir ${trt_model_path}
--model_dir "${model_path}" \
--dtype "${model_dtype}" \
--tp_size "${model_tp_size}" \
--output_dir "${trt_model_path}"
trtllm-build \
--checkpoint_dir ${trt_model_path} \
--checkpoint_dir "${trt_model_path}" \
--use_fused_mlp \
--reduce_fusion disable \
--workers 8 \
--gpt_attention_plugin ${model_dtype} \
--gemm_plugin ${model_dtype} \
--tp_size ${model_tp_size} \
--max_batch_size ${max_batch_size} \
--max_input_len ${max_input_len} \
--max_seq_len ${max_seq_len} \
--max_num_tokens ${max_num_tokens} \
--output_dir ${trt_engine_path}
--gpt_attention_plugin "${model_dtype}" \
--gemm_plugin "${model_dtype}" \
--tp_size "${model_tp_size}" \
--max_batch_size "${max_batch_size}" \
--max_input_len "${max_input_len}" \
--max_seq_len "${max_seq_len}" \
--max_num_tokens "${max_num_tokens}" \
--output_dir "${trt_engine_path}"

# handle triton protobuf files and launch triton server
cd /tensorrtllm_backend
mkdir triton_model_repo
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cd triton_model_repo
rm -rf ./tensorrt_llm/1/*
cp -r ${trt_engine_path}/* ./tensorrt_llm/1
cp -r "${trt_engine_path}"/* ./tensorrt_llm/1
python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5"
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false"
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:"$max_batch_size"
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt "triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:False,bls_instance_count:1"
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py \
--world_size=${model_tp_size} \
--world_size="${model_tp_size}" \
--model_repo=/tensorrtllm_backend/triton_model_repo &

}

launch_tgi_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -129,10 +125,7 @@ launch_tgi_server() {
launch_lmdeploy_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

server_command="lmdeploy serve api_server $model \
Expand All @@ -149,10 +142,7 @@ launch_sglang_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -185,10 +175,7 @@ launch_vllm_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -217,19 +204,19 @@ launch_vllm_server() {

main() {

if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "trt" ]]; then
launch_trt_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "tgi" ]]; then
launch_tgi_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
launch_lmdeploy_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "sglang" ]]; then
launch_sglang_server
fi

Expand Down
12 changes: 6 additions & 6 deletions .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ main() {
fi

# initial annotation
description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"
#description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"

# download results
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
mkdir -p results/
/workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/
ls
Expand All @@ -30,15 +30,15 @@ main() {
/workspace/buildkite-agent artifact upload "results.zip"

# upload benchmarking scripts
cd $VLLM_SOURCE_CODE_LOC/
cd "$VLLM_SOURCE_CODE_LOC/"
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# upload benchmarking pipeline
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md


Expand Down Expand Up @@ -75,4 +75,4 @@ main() {
# /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
}

main "$@"
main "$@"
Loading
Loading