Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ROCm vLLM to 0.4.3 #40

Merged
merged 393 commits into from
Jun 6, 2024
Merged
Changes from 1 commit
Commits
Show all changes
393 commits
Select commit Hold shift + click to select a range
ba4be44
[BugFix] Fix return type of executor execute_model methods (#4402)
njhill Apr 27, 2024
4ea1f96
[BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418)
robertgshaw2-redhat Apr 27, 2024
9c7306a
[Misc] fix typo in llm_engine init logging (#4428)
DefTruth Apr 28, 2024
bf480c5
Add more Prometheus metrics (#2764)
ronensc Apr 28, 2024
03dd7d5
[CI] clean docker cache for neuron (#4441)
simon-mo Apr 28, 2024
df29793
[mypy][5/N] Support all typing on model executor (#4427)
rkooo567 Apr 29, 2024
73c8d67
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922)
robertgshaw2-redhat Apr 29, 2024
ac5ccf0
[CI] hotfix: soft fail neuron test (#4458)
simon-mo Apr 29, 2024
f4f921b
[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444)
youkaichao Apr 29, 2024
d627a3d
[Misc] Upgrade to `torch==2.3.0` (#4454)
mgoin Apr 30, 2024
fa32207
[Bugfix][Kernel] Fix compute_type for MoE kernel (#4463)
WoosukKwon Apr 30, 2024
26f2fb5
[Core]Refactor gptq_marlin ops (#4466)
jikunshang Apr 30, 2024
4bb53e2
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
leiwen83 Apr 30, 2024
b31a1fb
[Doc] add visualization for multi-stage dockerfile (#4456)
prashantgupta24 Apr 30, 2024
111815d
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332)
robertgshaw2-redhat Apr 30, 2024
a494140
[Frontend] Support complex message content for chat completions endpo…
fgreinacher Apr 30, 2024
715c2d8
[Frontend] [Core] Tensorizer: support dynamic `num_readers`, update v…
alpayariyak Apr 30, 2024
dd1a50a
[Bugfix][Minor] Make ignore_eos effective (#4468)
bigPYJ1151 Apr 30, 2024
6ad58f4
fix_tokenizer_snapshot_download_bug (#4493)
kingljl Apr 30, 2024
ee37328
Unable to find Punica extension issue during source code installation…
kingljl May 1, 2024
2e240c6
[Core] Centralize GPU Worker construction (#4419)
njhill May 1, 2024
f458112
[Misc][Typo] type annotation fix (#4495)
HarryWu99 May 1, 2024
a822eb3
[Misc] fix typo in block manager (#4453)
Juelianqvq May 1, 2024
c3845d8
Allow user to define whitespace pattern for outlines (#4305)
robcaulk May 1, 2024
d6f4bd7
[Misc]Add customized information for models (#4132)
jeejeelee May 1, 2024
6f1df80
[Test] Add ignore_eos test (#4519)
rkooo567 May 1, 2024
a88bb9b
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to…
AnyISalIn May 1, 2024
4dc8026
[Bugfix] Fix 307 Redirect for `/metrics` (#4523)
robertgshaw2-redhat May 1, 2024
e491c7e
[Doc] update(example model): for OpenAI compatible serving (#4503)
fpaupier May 1, 2024
6990912
[Bugfix] Use random seed if seed is -1 (#4531)
sasha0552 May 1, 2024
8b798ee
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)
tjohnson31415 May 1, 2024
b38e42f
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
leiwen83 May 1, 2024
24750f4
[Core] Enable prefix caching with block manager v2 enabled (#4142)
leiwen83 May 1, 2024
a657bfc
[Core] Add `multiproc_worker_utils` for multiprocessing-based workers…
njhill May 1, 2024
24bb4fe
[Kernel] Update fused_moe tuning script for FP8 (#4457)
pcmoritz May 1, 2024
c47ba4a
[Bugfix] Add validation for seed (#4529)
sasha0552 May 1, 2024
3a922c1
[Bugfix][Core] Fix and refactor logging stats (#4336)
esmeetu May 1, 2024
6ef09b0
[Core][Distributed] fix pynccl del error (#4508)
youkaichao May 1, 2024
c9d852d
[Misc] Remove Mixtral device="cuda" declarations (#4543)
pcmoritz May 1, 2024
826b82a
[Misc] Fix expert_ids shape in MoE (#4517)
WoosukKwon May 1, 2024
b8afa8b
[MISC] Rework logger to enable pythonic custom logging configuration …
May 2, 2024
0d62fe5
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.…
rkooo567 May 2, 2024
5e401bc
[CI]Add regression tests to ensure the async engine generates metrics…
ronensc May 2, 2024
cf8cac8
[mypy][6/N] Fix all the core subdirectory typing (#4450)
rkooo567 May 2, 2024
2a85f93
[Core][Distributed] enable multiple tp group (#4512)
youkaichao May 2, 2024
7038e8b
[Kernel] Support running GPTQ 8-bit models in Marlin (#4533)
alexm-redhat May 2, 2024
fb087af
[mypy][7/N] Cover all directories (#4555)
rkooo567 May 2, 2024
5ad60b0
[Misc] Exclude the `tests` directory from being packaged (#4552)
itechbear May 2, 2024
1ff0c73
[BugFix] Include target-device specific requirements.txt in sdist (#4…
markmc May 2, 2024
5b8a7c1
[Misc] centralize all usage of environment variables (#4548)
youkaichao May 2, 2024
32881f3
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
mmoskal May 2, 2024
9b5c9f9
[CI/Build] AMD CI pipeline with extended set of tests. (#4267)
Alexei-V-Ivanov-AMD May 2, 2024
0f8a914
[Core] Ignore infeasible swap requests. (#4557)
rkooo567 May 2, 2024
344a5d0
[Core][Distributed] enable allreduce for multiple tp groups (#4566)
youkaichao May 3, 2024
808632d
[BugFix] Prevent the task of `_force_log` from being garbage collecte…
Atry May 3, 2024
ce3f1ee
[Misc] remove chunk detected debug logs (#4571)
DefTruth May 3, 2024
2d7bce9
[Doc] add env vars to the doc (#4572)
youkaichao May 3, 2024
3521ba4
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)
rkooo567 May 3, 2024
7e65477
[Bugfix] Allow "None" or "" to be passed to CLI for string args that …
mgoin May 3, 2024
f8e7add
Fix/async chat serving (#2727)
schoennenbeck May 3, 2024
43c413e
[Kernel] Use flashinfer for decoding (#4353)
LiuXiaoxuanPKU May 3, 2024
ab50275
[Speculative decoding] Support target-model logprobs (#4378)
cadedaniel May 3, 2024
344bf7c
[Misc] add installation time env vars (#4574)
youkaichao May 3, 2024
bc8ad68
[Misc][Refactor] Introduce ExecuteModelData (#4540)
comaniac May 4, 2024
36fb68f
[Doc] Chunked Prefill Documentation (#4580)
rkooo567 May 4, 2024
2a05201
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with…
mgoin May 4, 2024
021b1a2
[CI] check size of the wheels (#4319)
simon-mo May 4, 2024
4302987
[Bugfix] Fix inappropriate content of model_name tag in Prometheus me…
DearPlanet May 4, 2024
8d8357c
bump version to v0.4.2 (#4600)
simon-mo May 5, 2024
c7f2cf2
[CI] Reduce wheel size by not shipping debug symbols (#4602)
simon-mo May 5, 2024
0650e59
Disable cuda version check in vllm-openai image (#4530)
zhaoyang-star May 5, 2024
323f27b
[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623)
DarkLight1337 May 6, 2024
e186d37
[CI] use ccache actions properly in release workflow (#4629)
simon-mo May 6, 2024
19cb471
[CI] Add retry for agent lost (#4633)
cadedaniel May 6, 2024
bd99d22
Update lm-format-enforcer to 0.10.1 (#4631)
noamgat May 6, 2024
a98187c
[Kernel] Make static FP8 scaling more robust (#4570)
pcmoritz May 7, 2024
63575bc
[Core][Optimization] change python dict to pytorch tensor (#4607)
youkaichao May 7, 2024
478aed5
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642)
Alexei-V-Ivanov-AMD May 7, 2024
10760da
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithL…
FurtherAI May 7, 2024
469f85c
[Core][Optimization] change copy-on-write from dict[int, list] to lis…
youkaichao May 7, 2024
8344f77
[Bug fix][Core] fixup ngram not setup correctly (#4551)
leiwen83 May 7, 2024
cc466a3
[Core][Distributed] support cpu&device in broadcast tensor dict (#4660)
youkaichao May 8, 2024
d7740ea
[Core] Optimize sampler get_logprobs (#4594)
rkooo567 May 8, 2024
f6a5930
[CI] Make mistral tests pass (#4596)
rkooo567 May 8, 2024
0f9a6e3
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi …
DefTruth May 8, 2024
5510cf0
[Misc] Add `get_name` method to attention backends (#4685)
WoosukKwon May 8, 2024
ad932a2
[Core] Faster startup for LoRA enabled models (#4634)
Yard1 May 8, 2024
20cfcde
[Core][Optimization] change python dict to pytorch tensor for blocks …
youkaichao May 8, 2024
230c4b3
[CI/Test] fix swap test for multi gpu (#4689)
youkaichao May 8, 2024
89579a2
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
WoosukKwon May 8, 2024
f942efb
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
comaniac May 8, 2024
8b9241b
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…
cadedaniel May 8, 2024
e288df0
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (…
alexm-redhat May 9, 2024
16bc0a0
[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)
MahmoudAshraf97 May 9, 2024
f12b20d
[Frontend] Move async logic outside of constructor (#4674)
DarkLight1337 May 9, 2024
190bc83
[Misc] Remove unnecessary ModelRunner imports (#4703)
WoosukKwon May 9, 2024
0ee535b
[Misc] Set block size at initialization & Fix test_model_runner (#4705)
WoosukKwon May 9, 2024
ff5abcd
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
kliuae May 9, 2024
a3c1245
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)
DarkLight1337 May 9, 2024
cea6443
[Bugfix] Update grafana.json (#4711)
robertgshaw2-redhat May 9, 2024
be0c518
[Bugfix] Add logs for all model dtype casting (#4717)
mgoin May 9, 2024
ebce310
[Model] Snowflake arctic model implementation (#4652)
sfc-gh-hazhang May 9, 2024
379da6d
[Kernel] [FP8] Improve FP8 linear layer performance (#4691)
pcmoritz May 9, 2024
c833101
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
comaniac May 10, 2024
208b71b
[Core][Distributed] refactor pynccl (#4591)
youkaichao May 10, 2024
e965d46
[Misc] Keep only one implementation of the create_dummy_prompt functi…
AllenDou May 10, 2024
51d4094
chunked-prefill-doc-syntax (#4603)
simon-mo May 10, 2024
64b77df
[Core]fix type annotation for `swap_blocks` (#4726)
jikunshang May 10, 2024
dac6a3f
[Misc] Apply a couple g++ cleanups (#4719)
stevegrubb May 10, 2024
6a0f617
[Core] Fix circular reference which leaked llm instance in local dev …
rkooo567 May 10, 2024
706588a
[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)
AllenDou May 10, 2024
2e7796f
[Speculative decoding] CUDA graph support (#4295)
heeju-kim2 May 10, 2024
fcc2994
[CI] Nits for bad initialization of SeqGroup in testing (#4748)
robertgshaw2-redhat May 10, 2024
4e12131
[Core][Test] fix function name typo in custom allreduce (#4750)
youkaichao May 10, 2024
e254497
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
CatherineSue May 11, 2024
6eaccb7
[Model] Add support for IBM Granite Code models (#4636)
yikangshen May 12, 2024
a709e87
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)
robertgshaw2-redhat May 13, 2024
a7be4d0
[CORE] Improvement in ranks code (#4718)
SwapnilDreams100 May 13, 2024
702bee4
[Core][Distributed] refactor custom allreduce to support multiple tp …
youkaichao May 13, 2024
350f9e1
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
DarkLight1337 May 13, 2024
e7c46b9
[Scheduler] Warning upon preemption and Swapping (#4647)
rkooo567 May 13, 2024
0fca3cd
[Misc] Enhance attention selector (#4751)
WoosukKwon May 13, 2024
8bc68e1
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, u…
sangstar May 13, 2024
ce532ff
[Speculative decoding] Improve n-gram efficiency (#4724)
comaniac May 13, 2024
1356df5
[Kernel] Use flash-attn for decoding (#3648)
skrider May 13, 2024
33d3914
[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793)
pcmoritz May 13, 2024
ac1fbf7
[Doc] Shorten README by removing supported model list (#4796)
zhuohan123 May 13, 2024
4bfa7e7
[Doc] Add API reference for offline inference (#4710)
DarkLight1337 May 14, 2024
c579b75
[Doc] Add meetups to the doc (#4798)
zhuohan123 May 14, 2024
ccb63a8
[Core][Hash][Automatic Prefix caching] Accelerating the hashing funct…
KuntaiDu May 14, 2024
dc72402
[Bugfix][Doc] Fix CI failure in docs (#4804)
DarkLight1337 May 14, 2024
676a999
[Core] Add MultiprocessingGPUExecutor (#4539)
njhill May 14, 2024
29bc01b
Add 4th meetup announcement to readme (#4817)
simon-mo May 14, 2024
8a7cc25
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
rkooo567 May 15, 2024
65bf2ac
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill …
rkooo567 May 15, 2024
e9cdd2b
[CI/Build] Further decouple HuggingFace implementation from ours duri…
DarkLight1337 May 15, 2024
a5675d3
[Bugfix] Properly set distributed_executor_backend in ParallelConfig …
zifeitong May 15, 2024
361c461
[Doc] Highlight the fourth meetup in the README (#4842)
zhuohan123 May 15, 2024
fc0d9df
[Frontend] Re-enable custom roles in Chat Completions API (#4758)
DarkLight1337 May 15, 2024
52f8107
[Frontend] Support OpenAI batch file format (#4794)
wuisawesome May 15, 2024
30e7543
[Core] Implement sharded state loader (#4690)
aurickq May 16, 2024
973617a
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
comaniac May 16, 2024
5c34257
Add marlin unit tests and marlin benchmark script (#4815)
alexm-redhat May 16, 2024
99caa49
[Kernel] add bfloat16 support for gptq marlin kernel (#4788)
jinzhen-lin May 16, 2024
dbc0754
[docs] Fix typo in examples filename openi -> openai (#4864)
wuisawesome May 16, 2024
5e0391c
[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851)
wuisawesome May 16, 2024
9216b9c
[Bugfix] Bypass authorization API token for preflight requests (#4862)
dulacp May 16, 2024
6979ade
Add GPTQ Marlin 2:4 sparse structured support (#4790)
alexm-redhat May 16, 2024
f09edd8
Add JSON output support for benchmark_latency and benchmark_throughpu…
simon-mo May 16, 2024
b5853f9
[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845)
hongxiayang May 16, 2024
e081880
[Core][Distributed] remove graph mode function (#4818)
youkaichao May 16, 2024
10fa9ee
[Misc] remove old comments (#4866)
youkaichao May 16, 2024
8435b20
[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Silencioo May 16, 2024
2060e93
[Kernel] Add w8a8 CUTLASS kernels (#4749)
tlrmchlsmth May 16, 2024
9a31a81
[Bugfix] Fix FP8 KV cache support (#4869)
WoosukKwon May 16, 2024
8e7fb5d
Support to serve vLLM on Kubernetes with LWS (#4829)
kerthcet May 16, 2024
0150a10
[Frontend] OpenAI API server: Do not add bos token by default when en…
bofenghuang May 17, 2024
2614812
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
Alexei-V-Ivanov-AMD May 17, 2024
33e0823
[Bugfix] fix rope error when load models with different dtypes (#4835)
jinzhen-lin May 17, 2024
48d5985
Sync huggingface modifications of qwen Moe model (#4774)
eigen2017 May 17, 2024
c5711ef
[Doc] Update Ray Data distributed offline inference example (#4871)
Yard1 May 17, 2024
86b45ae
[Bugfix] Relax tiktoken to >= 0.6.0 (#4890)
mgoin May 17, 2024
c0724fc
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if…
alexeykondrat May 18, 2024
2e9a222
[Lora] Support long context lora (#4787)
rkooo567 May 18, 2024
f68470e
[Bugfix][Model] Add base class for vision-language models (#4809)
DarkLight1337 May 19, 2024
27ce854
[Kernel] Add marlin_24 unit tests (#4901)
alexm-redhat May 19, 2024
b57e6c5
[Kernel] Add flash-attn back (#4907)
WoosukKwon May 20, 2024
6287537
[Model] LLaVA model refactor (#4910)
DarkLight1337 May 20, 2024
da5a0b5
Remove marlin warning (#4918)
alexm-redhat May 20, 2024
546a97e
[Misc]: allow user to specify port in distributed setting (#4914)
ZwwWayne May 20, 2024
943e72c
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Alexei-V-Ivanov-AMD May 20, 2024
f0eecee
[Bugfix] Fix dummy weight for fp8 (#4916)
mzusman May 20, 2024
1937e29
[Core] Sharded State Loader download from HF (#4889)
aurickq May 20, 2024
c3af447
[Doc]Add documentation to benchmarking script when running TGI (#4920)
KuntaiDu May 20, 2024
65ae8c2
[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897)
Yard1 May 21, 2024
d130b57
[Model] add rope_scaling support for qwen2 (#4930)
hzhwcmhf May 21, 2024
f12c3b5
[Model] Add Phi-2 LoRA support (#4886)
Isotr0py May 21, 2024
e941f88
[Docs] Add acknowledgment for sponsors (#4925)
simon-mo May 21, 2024
757b62c
[CI/Build] Codespell ignore `build/` directory (#4945)
mgoin May 21, 2024
14772ee
[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935)
kerthcet May 21, 2024
99eff67
[Bugfix][Kernel] Add head size check for attention backend selection …
Isotr0py May 21, 2024
9b9a10d
[Frontend] Dynamic RoPE scaling (#4638)
sasha0552 May 22, 2024
5f6d10c
[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…
mgoin May 22, 2024
c74c913
[misc] remove comments that were supposed to be removed (#4977)
rkooo567 May 22, 2024
8674f98
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
tlrmchlsmth May 22, 2024
a3a73ab
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
comaniac May 22, 2024
97b0300
[Model] LoRA gptbigcode implementation (#3949)
raywanb May 22, 2024
eb6d3c2
[Core] Eliminate parallel worker per-step task scheduling overhead (#…
njhill May 22, 2024
a36de68
[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…
pcmoritz May 22, 2024
ee3eea0
[Misc] Take user preference in attention selector (#4960)
comaniac May 22, 2024
6066253
Marlin 24 prefill performance improvement (about 25% better on averag…
alexm-redhat May 23, 2024
2ba80be
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…
LetianLee May 23, 2024
5eda2ea
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
andoorve May 23, 2024
a124232
[Kernel] Initial Activation Quantization Support (#4525)
dsikka May 23, 2024
e3470f8
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
kezouke May 23, 2024
6a50f4c
[Doc] add ccache guide in doc (#5012)
youkaichao May 23, 2024
9197709
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
robertgshaw2-redhat May 24, 2024
e64fde4
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
leiwen83 May 24, 2024
8e192ff
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…
linxihui May 25, 2024
325c119
[Misc] add logging level env var (#5045)
youkaichao May 25, 2024
d5a1697
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding …
LiuXiaoxuanPKU May 25, 2024
f17a1a8
[Misc] Make Serving Benchmark More User-friendly (#5044)
ywang96 May 25, 2024
1102bef
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
zhuohan123 May 27, 2024
fbdb7b3
[Core] Allow AQLM on Pascal (#5058)
sasha0552 May 27, 2024
890aa93
[Model] Add support for falcon-11B (#5069)
Isotr0py May 27, 2024
d4f3985
[Core] Sliding window for block manager v2 (#4545)
mmoskal May 28, 2024
9ba4155
[BugFix] Fix Embedding Models with TP>1 (#5075)
robertgshaw2-redhat May 28, 2024
dd8de11
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
divakar-amd May 28, 2024
290f4ad
[Docs] Add Dropbox as sponsors (#5089)
simon-mo May 28, 2024
5ae5ed1
[Core] Consolidate prompt arguments to LLM engines (#4328)
DarkLight1337 May 28, 2024
dfba529
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
jsato8094 May 29, 2024
616e600
[Misc] add gpu_memory_utilization arg (#5079)
pandyamarut May 29, 2024
5bd3c65
[Core][Optimization] remove vllm-nccl (#5091)
youkaichao May 29, 2024
18c1f16
[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092)
DarkLight1337 May 29, 2024
594392d
[Core][Distributed] improve p2p access check (#4992)
youkaichao May 29, 2024
4238bc8
[Core] Cross-attention KV caching and memory-management (towards even…
afeldman-nm May 29, 2024
ae495c7
[Doc]Replace deprecated flag in readme (#4526)
ronensc May 29, 2024
eecd864
[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterato…
DarkLight1337 May 29, 2024
eb6c50c
[Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` …
DarkLight1337 May 29, 2024
b1c2556
[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099)
DarkLight1337 May 29, 2024
7c3604f
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031)
Etelis May 29, 2024
4fbcb0f
[Doc][Build] update after removing vllm-nccl (#5103)
youkaichao May 29, 2024
5bf185a
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#…
alexm-redhat May 30, 2024
e07aff9
[CI/Build] Docker cleanup functionality for amd servers (#5112)
okakarpa May 30, 2024
87d41c8
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
br3no May 30, 2024
d910816
[Bugfix] Automatically Detect SparseML models (#5119)
robertgshaw2-redhat May 30, 2024
f758505
[CI/Build] increase wheel size limit to 200 MB (#5130)
youkaichao May 30, 2024
d79d9ea
[Misc] remove duplicate definition of `seq_lens_tensor` in model_runn…
ita9naiwa May 30, 2024
a9bcc7a
[Doc] Use intersphinx and update entrypoints docs (#5125)
DarkLight1337 May 30, 2024
429d897
add doc about serving option on dstack (#3074)
deep-diver May 30, 2024
87a658c
Bump version to v0.4.3 (#5046)
simon-mo May 30, 2024
45a1a69
[Build] Disable sm_90a in cu11 (#5141)
simon-mo May 30, 2024
b35be54
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
robertgshaw2-redhat May 31, 2024
6d21fa1
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::orde…
alexm-redhat May 31, 2024
533c217
Fix cutlass sm_90a vesrion in CMakeList
simon-mo May 31, 2024
a22dea5
[Model] Support MAP-NEO model (#5081)
xingweiqu May 31, 2024
e9d3aa0
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using th…
simon-mo May 31, 2024
a377f0b
[Misc]: optimize eager mode host time (#4196)
FuncSherl May 31, 2024
e9899fb
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
comaniac May 31, 2024
6575791
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171)
njhill Jun 1, 2024
1197e02
[Build] Guard against older CUDA versions when building CUTLASS 3.x k…
tlrmchlsmth Jun 1, 2024
a360ff8
[CI/Build] CMakeLists: build all extensions' cmake targets at the sam…
dtrifiro Jun 1, 2024
4019807
Update Dockerfile.rocm
shajrawi May 30, 2024
1f7c555
Merge commit 'a360ff80bb34f9dfcd21cf880c2030daa2d6b3a3' of https://gi…
mawong-amd Jun 4, 2024
324cc8b
Use world group to broadcast metadata on ROCm
mawong-amd Jun 4, 2024
b373a0e
Custom PagedAttn optimizations for ROCm
lcskrishna May 23, 2024
c893d70
Update linear.py
gshtras Jun 3, 2024
b0ba2db
adding rocm fp8
charlifu May 14, 2024
86bbfef
Fixes from main:
mawong-amd Jun 6, 2024
9e4e680
Merge branch 'main' of github.com:ROCm/vllm into main_upstream_candid…
mawong-amd Jun 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[Core][Optimization] change copy-on-write from dict[int, list] to list (
  • Loading branch information
youkaichao authored May 7, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 469f85c7829c301b6dec48725951b5501c18d611
6 changes: 2 additions & 4 deletions tests/core/block/test_block_table.py
Original file line number Diff line number Diff line change
@@ -410,8 +410,7 @@ def test_cow(block_size: int, sequence_len: int, append_len: int,
expected_src = static_block_table.physical_block_ids[cow_block_id]
expected_dst = appender_block_table.physical_block_ids[cow_block_id]

assert expected_src in cows
assert expected_dst in cows[expected_src]
assert (expected_src, expected_dst) in cows
else:
# Otherwise, there should be no copy-on-write.
assert not cows
@@ -490,8 +489,7 @@ def test_cow_lookahead_simple(block_size: int, sequence_len: int,
expected_src = static_block_table.physical_block_ids[cow_block_id]
expected_dst = appender_block_table.physical_block_ids[cow_block_id]

assert expected_src in cows
assert expected_dst in cows[expected_src]
assert (expected_src, expected_dst) in cows

static_block_table.free()
appender_block_table.free()
6 changes: 5 additions & 1 deletion tests/core/test_block_manager.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import time
from collections import defaultdict
from typing import List

import pytest
@@ -155,7 +156,10 @@ def test_append_slot_cow():

cows = block_manager.append_slots(child)
assert cows
for src_block, dst_blocks in cows.items():
dict_cows = defaultdict(list)
for src_block, dst_block in cows:
dict_cows[src_block].append(dst_block)
for src_block, dst_blocks in dict_cows.items():
assert src_block not in dst_blocks

after_blocks = block_manager.get_num_free_gpu_blocks()
4 changes: 2 additions & 2 deletions tests/core/test_scheduler.py
Original file line number Diff line number Diff line change
@@ -636,7 +636,7 @@ def test_schedule_decode_blocks_to_copy_update():

# The last request should be swapped out.
scheduler.block_manager.append_slots = MagicMock()
scheduler.block_manager.append_slots.return_value = {2: [3]}
scheduler.block_manager.append_slots.return_value = [(2, 3)]

budget = create_token_budget()
remaining_running, output = scheduler._schedule_running(
@@ -845,7 +845,7 @@ def test_schedule_swapped_blocks_to_copy():

# The last request should be swapped out.
scheduler.block_manager.append_slots = MagicMock()
scheduler.block_manager.append_slots.return_value = {2: [3]}
scheduler.block_manager.append_slots.return_value = [(2, 3)]

budget = create_token_budget()
remaining_swapped, output = scheduler._schedule_swapped(
21 changes: 10 additions & 11 deletions vllm/core/block/common.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
from collections import defaultdict
from typing import Dict, Iterable, List, Optional, Protocol
from typing import Dict, Iterable, List, Optional, Protocol, Tuple

from vllm.core.block.interfaces import Block, BlockAllocator

@@ -111,7 +110,7 @@ def __init__(
refcounter: RefCounterProtocol,
allocator: BlockAllocator,
):
self._copy_on_writes: Dict[BlockId, List[BlockId]] = defaultdict(list)
self._copy_on_writes: List[Tuple[BlockId, BlockId]] = []
self._refcounter = refcounter
self._allocator = allocator

@@ -152,25 +151,25 @@ def cow_block_if_not_appendable(self, block: Block) -> Optional[BlockId]:
# Track src/dst copy.
assert src_block_id is not None
assert block_id is not None
self._copy_on_writes[src_block_id].append(block_id)
self._copy_on_writes.append((src_block_id, block_id))

return block_id

def clear_cows(self) -> Dict[BlockId, List[BlockId]]:
def clear_cows(self) -> List[Tuple[BlockId, BlockId]]:
"""Clears the copy-on-write tracking information and returns the current
state.

This method returns a dictionary mapping source block indices to lists
of destination block indices for the current copy-on-write operations.
This method returns a list mapping source block indices to
destination block indices for the current copy-on-write operations.
It then clears the internal tracking information.

Returns:
Dict[BlockId, List[BlockId]]: A dictionary mapping source
block indices to lists of destination block indices for the
List[Tuple[BlockId, BlockId]]: A list mapping source
block indices to destination block indices for the
current copy-on-write operations.
"""
cows = dict(self._copy_on_writes)
self._copy_on_writes.clear()
cows = self._copy_on_writes
self._copy_on_writes = []
return cows


8 changes: 4 additions & 4 deletions vllm/core/block/cpu_gpu_block_allocator.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Dict, FrozenSet, List, Optional
from typing import Dict, FrozenSet, List, Optional, Tuple

from vllm.core.block.interfaces import (Block, BlockAllocator, BlockId,
DeviceAwareBlockAllocator)
@@ -185,13 +185,13 @@ def get_num_free_blocks(self, device: Device) -> int:
def get_num_total_blocks(self, device: Device) -> int:
return self._allocators[device].get_num_total_blocks()

def clear_copy_on_writes(self) -> Dict[int, List[int]]:
def clear_copy_on_writes(self) -> List[Tuple[int, int]]:
"""Clears the copy-on-write (CoW) state and returns the mapping of
source to destination block IDs.

Returns:
Dict[int, List[int]]: A dictionary mapping source block IDs to lists
of destination block IDs.
List[Tuple[int, int]]: A list mapping source block IDs to
destination block IDs.
"""
# CoW only supported on GPU
device = Device.GPU
6 changes: 3 additions & 3 deletions vllm/core/block/interfaces.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from abc import ABC, abstractmethod
from typing import Dict, FrozenSet, List, Optional, Protocol
from typing import FrozenSet, List, Optional, Protocol, Tuple

from vllm.utils import Device

@@ -122,7 +122,7 @@ def all_block_ids(self) -> FrozenSet[int]:
pass

@abstractmethod
def clear_copy_on_writes(self) -> Dict[int, List[int]]:
def clear_copy_on_writes(self) -> List[Tuple[int, int]]:
pass

@abstractmethod
@@ -187,7 +187,7 @@ def all_block_ids(self) -> FrozenSet[int]:
pass

@abstractmethod
def clear_copy_on_writes(self) -> Dict[int, List[int]]:
def clear_copy_on_writes(self) -> List[Tuple[int, int]]:
pass

@abstractmethod
8 changes: 4 additions & 4 deletions vllm/core/block/naive_block.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Dict, FrozenSet, Iterable, List, Optional, Set
from typing import FrozenSet, Iterable, List, Optional, Set, Tuple

from vllm.core.block.common import (CopyOnWriteTracker, RefCounter,
get_all_blocks_recursively)
@@ -175,12 +175,12 @@ def cow_block_if_not_appendable(self, block: Block) -> Optional[BlockId]:
"""
return self._cow_tracker.cow_block_if_not_appendable(block)

def clear_copy_on_writes(self) -> Dict[BlockId, List[BlockId]]:
def clear_copy_on_writes(self) -> List[Tuple[BlockId, BlockId]]:
"""Returns the copy-on-write source->destination mapping and clears it.

Returns:
Dict[BlockId, List[BlockId]]: A dictionary mapping source
block indices to lists of destination block indices.
List[Tuple[BlockId, BlockId]]: A list mapping source
block indices to destination block indices.
"""
return self._cow_tracker.clear_cows()

8 changes: 4 additions & 4 deletions vllm/core/block/prefix_caching_block.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Token blocks."""
from itertools import takewhile
from os.path import commonprefix
from typing import Dict, FrozenSet, Iterable, List, Optional
from typing import Dict, FrozenSet, Iterable, List, Optional, Tuple

from vllm.core.block.common import (CopyOnWriteTracker,
get_all_blocks_recursively)
@@ -337,12 +337,12 @@ def cow_block_if_not_appendable(self, block: Block) -> Optional[BlockId]:
"""
return self._cow_tracker.cow_block_if_not_appendable(block)

def clear_copy_on_writes(self) -> Dict[BlockId, List[BlockId]]:
def clear_copy_on_writes(self) -> List[Tuple[BlockId, BlockId]]:
"""Returns the copy-on-write source->destination mapping and clears it.

Returns:
Dict[BlockId, List[BlockId]]: A dictionary mapping source
block indices to lists of destination block indices.
List[Tuple[BlockId, BlockId]]: A list mapping source
block indices to destination block indices.
"""
return self._cow_tracker.clear_cows()

10 changes: 5 additions & 5 deletions vllm/core/block_manager_v1.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@
from os.path import commonprefix
from typing import Dict, List, Optional
from typing import Sequence as GenericSequence
from typing import Set
from typing import Set, Tuple

from vllm.block import BlockTable, PhysicalTokenBlock
from vllm.core.evictor_v1 import EvictionPolicy, Evictor, make_evictor
@@ -386,7 +386,7 @@ def append_slots(
self,
seq: Sequence,
num_lookahead_slots: int = 0,
) -> Dict[int, List[int]]:
) -> List[Tuple[int, int]]:
"""Allocate a physical slot for a new token."""
logical_blocks = seq.logical_token_blocks
block_table = self.block_tables[seq.seq_id]
@@ -405,7 +405,7 @@ def append_slots(
# Allocate a new physical block.
new_block = self._allocate_last_physical_block(seq)
block_table.append(new_block)
return {}
return []

# We want to append the token to the last physical block.
last_block = block_table[-1]
@@ -418,15 +418,15 @@ def append_slots(
maybe_new_block = self._maybe_promote_last_block(
seq, last_block)
block_table[-1] = maybe_new_block
return {}
return []
else:
# The last block is shared with other sequences.
# Copy on Write: Allocate a new block and copy the tokens.
new_block = self._allocate_last_physical_block(seq)

block_table[-1] = new_block
self.gpu_allocator.free(last_block)
return {last_block.block_number: [new_block.block_number]}
return [(last_block.block_number, new_block.block_number)]

def fork(self, parent_seq: Sequence, child_seq: Sequence) -> None:
# NOTE: fork does not allocate a new physical block.
3 changes: 2 additions & 1 deletion vllm/core/block_manager_v2.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""A block manager that manages token blocks."""
from typing import Dict, List, Optional
from typing import Sequence as GenericSequence
from typing import Tuple

from vllm.core.block.block_table import BlockTable
from vllm.core.block.cpu_gpu_block_allocator import CpuGpuBlockAllocator
@@ -166,7 +167,7 @@ def append_slots(
self,
seq: Sequence,
num_lookahead_slots: int,
) -> Dict[int, List[int]]:
) -> List[Tuple[int, int]]:

block_table = self.block_tables[seq.seq_id]

3 changes: 2 additions & 1 deletion vllm/core/interfaces.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,7 @@
from abc import ABC, abstractmethod
from typing import Dict, List
from typing import Sequence as GenericSequence
from typing import Tuple

from vllm.sequence import Sequence, SequenceGroup

@@ -54,7 +55,7 @@ def append_slots(
self,
seq: Sequence,
num_lookahead_slots: int,
) -> Dict[int, List[int]]:
) -> List[Tuple[int, int]]:
pass

@abstractmethod
5 changes: 1 addition & 4 deletions vllm/core/scheduler.py
Original file line number Diff line number Diff line change
@@ -1027,10 +1027,7 @@ def _append_slots(

for seq in seq_group.get_seqs(status=SequenceStatus.RUNNING):
cows = self.block_manager.append_slots(seq, num_lookahead_slots)

for src, dests in cows.items():
for dest in dests:
blocks_to_copy.append((src, dest))
blocks_to_copy.extend(cows)

def _preempt(
self,