Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 24/09/16 #187

Merged
merged 83 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs Sep 9, 2024
6cd5e5b
[Misc] Fused MoE Marlin support for GPTQ (#8217)
dsikka Sep 10, 2024
a1d8742
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (…
simon-mo Sep 10, 2024
da1a844
[Bugfix] Fix missing `post_layernorm` in CLIP (#8155)
DarkLight1337 Sep 10, 2024
6234385
[CI/Build] enable ccache/scccache for HIP builds (#8327)
dtrifiro Sep 10, 2024
8c054b7
[Frontend] Clean up type annotations for mistral tokenizer (#8314)
DarkLight1337 Sep 10, 2024
f421f3c
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that…
alexeykondrat Sep 10, 2024
02751a7
Fix ppc64le buildkite job (#8309)
sumitd2 Sep 10, 2024
5faedf1
[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
kevin314 Sep 10, 2024
04e7c4e
[Misc] remove peft as dependency for prompt models (#8162)
prashantgupta24 Sep 10, 2024
b1f3e18
[MISC] Keep chunked prefill enabled by default with long context when…
comaniac Sep 10, 2024
22f3a4b
[Bugfix] lookahead block table with cuda graph max capture (#8340)
alexm-neuralmagic Sep 10, 2024
1d5e397
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (#8172)
SolitaryThinker Sep 10, 2024
94144e7
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043)
tlrmchlsmth Sep 10, 2024
e497b8a
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (#8329)
jeejeelee Sep 11, 2024
1230263
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parall…
Isotr0py Sep 11, 2024
efcf946
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (…
pavanimajety Sep 11, 2024
6a512a0
[model] Support for Llava-Next-Video model (#7559)
TKONIY Sep 11, 2024
cea95df
[Frontend] Create ErrorResponse instead of raising exceptions in run_…
pooyadavoodi Sep 11, 2024
3b7fea7
[Model][VLM] Add Qwen2-VL model support (#7905)
fyabc Sep 11, 2024
0b952af
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
bigPYJ1151 Sep 11, 2024
aea02f3
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investiga…
alexeykondrat Sep 11, 2024
7015417
[Bugfix] Add missing attributes in mistral tokenizer (#8364)
DarkLight1337 Sep 11, 2024
73202db
[Kernel][Misc] register ops to prevent graph breaks (#6917)
bnellnm Sep 11, 2024
8baa454
[Misc] Move device options to a single place (#8322)
akx Sep 11, 2024
775f00f
[Speculative Decoding] Test refactor (#8317)
LiuXiaoxuanPKU Sep 11, 2024
d394787
Pixtral (#8377)
patrickvonplaten Sep 11, 2024
3fd2b0d
Bump version to v0.6.1 (#8379)
simon-mo Sep 11, 2024
a65cb16
[MISC] Dump model runner inputs when crashing (#8305)
comaniac Sep 12, 2024
f842a7a
[misc] remove engine_use_ray (#8126)
youkaichao Sep 12, 2024
b71c956
[TPU] Use Ray for default distributed backend (#8389)
WoosukKwon Sep 12, 2024
b6c75e1
Fix the AMD weight loading tests (#8390)
mgoin Sep 12, 2024
5a60699
[Bugfix]: Fix the logic for deciding if tool parsing is used (#8366)
tomeras91 Sep 12, 2024
1bf2dd9
[Gemma2] add bitsandbytes support for Gemma2 (#8338)
blueyo0 Sep 12, 2024
295c473
[Misc] Raise error when using encoder/decoder model with cpu backend …
kevin314 Sep 12, 2024
42ffba1
[Misc] Use RoPE cache for MRoPE (#8396)
WoosukKwon Sep 12, 2024
7de49aa
[torch.compile] hide slicing under custom op for inductor (#8384)
youkaichao Sep 12, 2024
520ca38
[Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399)
ywang96 Sep 12, 2024
e56bf27
[Bugfix] Fix InternVL2 inference with various num_patches (#8375)
Isotr0py Sep 12, 2024
c6202da
[Model] Support multiple images for qwen-vl (#8247)
alex-jw-brooks Sep 12, 2024
8a23e93
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instanc…
lnykww Sep 12, 2024
1f0c75a
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423)
vegaluisjose Sep 12, 2024
f2e263b
[Bugfix] Offline mode fix (#8376)
joerunde Sep 12, 2024
a6c0f36
[multi-step] add flashinfer backend (#7928)
SolitaryThinker Sep 12, 2024
551ce01
[Core] Add engine option to return only deltas or final output (#7381)
njhill Sep 12, 2024
0198772
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427)
alexm-neuralmagic Sep 12, 2024
c163694
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix cac…
ywang96 Sep 12, 2024
b61bd98
[CI/Build] Disable multi-node test for InternVL2 (#8428)
ywang96 Sep 12, 2024
d31174a
[Hotfix][Pixtral] Fix multiple images bugs (#8415)
patrickvonplaten Sep 12, 2024
a480939
[Bugfix] Fix weight loading issue by rename variable. (#8293)
wenxcs Sep 12, 2024
360ddbd
[Misc] Update Pixtral example (#8431)
ywang96 Sep 13, 2024
8f44a92
[BugFix] fix group_topk (#8430)
dsikka Sep 13, 2024
5ec9c0f
[Core] Factor out input preprocessing to a separate class (#7329)
DarkLight1337 Sep 13, 2024
40c3965
[Bugfix] Mapping physical device indices for e2e test utils (#8290)
ShangmingCai Sep 13, 2024
3f79bc3
[Bugfix] Bump fastapi and pydantic version (#8435)
DarkLight1337 Sep 13, 2024
8427550
[CI/Build] Update pixtral tests to use JSON (#8436)
DarkLight1337 Sep 13, 2024
6821020
[Bugfix] Fix async log stats (#8417)
alexm-neuralmagic Sep 13, 2024
ba77527
[bugfix] torch profiler bug for single gpu with GPUExecutor (#8354)
SolitaryThinker Sep 13, 2024
acda0b3
bump version to v0.6.1.post1 (#8440)
simon-mo Sep 13, 2024
9b4a3b2
[CI/Build] Enable InternVL2 PP test only on single node (#8437)
Isotr0py Sep 13, 2024
cab69a1
[doc] recommend pip instead of conda (#8446)
youkaichao Sep 13, 2024
06311e2
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442)
jeejeelee Sep 13, 2024
a246912
[misc][ci] fix quant test (#8449)
youkaichao Sep 13, 2024
ecd7a1d
[Installation] Gate FastAPI version for Python 3.8 (#8456)
DarkLight1337 Sep 13, 2024
0a4806f
[plugin][torch.compile] allow to add custom compile backend (#8445)
youkaichao Sep 13, 2024
a84e598
[CI/Build] Reorganize models tests (#7820)
DarkLight1337 Sep 13, 2024
f57092c
[Doc] Add oneDNN installation to CPU backend documentation (#8467)
Isotr0py Sep 13, 2024
18e9e1f
[HotFix] Fix final output truncation with stop string + streaming (#8…
njhill Sep 13, 2024
9ba0817
bump version to v0.6.1.post2 (#8473)
simon-mo Sep 13, 2024
daddc14
[bugfix] add multi-step advance_step to ROCmFlashAttentionMetadata
SolitaryThinker Sep 13, 2024
306f21f
add rocm to MULTI_STEP_ATTENTION_BACKENDS
SolitaryThinker Sep 13, 2024
8517252
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
jikunshang Sep 13, 2024
1ef0d2e
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
charlifu Sep 14, 2024
8a0cf1d
[Model] support minicpm3 (#8297)
SUDA-HLT-ywfang Sep 14, 2024
a36e070
[torch.compile] fix functionalization (#8480)
youkaichao Sep 14, 2024
47790f3
[torch.compile] add a flag to disable custom op (#8488)
youkaichao Sep 14, 2024
50e9ec4
[TPU] Implement multi-step scheduling (#8489)
WoosukKwon Sep 14, 2024
3724d5f
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by upda…
chrisociepa Sep 15, 2024
fc990f9
[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kern…
Isotr0py Sep 15, 2024
0f397c3
Merge remote-tracking branch 'upstream/main'
gshtras Sep 16, 2024
b0a39a4
New llm_engine output format
gshtras Sep 16, 2024
30a9875
Merge remote-tracking branch 'st/ms-rocm-advance-step' into upstream_…
gshtras Sep 16, 2024
c27753d
Fix tests - disable marlin_fiest_moe; fix rocm_paged attention
gshtras Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,36 @@ mkdir -p ${HF_CACHE}
HF_MOUNT="/root/.cache/huggingface"

commands=$@
echo "Commands:$commands"
#ignore certain kernels tests
if [[ $commands == *" kernels "* ]]; then
commands="${commands} \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_attention_selector.py \
--ignore=kernels/test_blocksparse_attention.py \
--ignore=kernels/test_causal_conv1d.py \
--ignore=kernels/test_cutlass.py \
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
--ignore=kernels/test_marlin_gemm.py \
--ignore=kernels/test_moe.py \
--ignore=kernels/test_prefix_prefill.py \
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py"
fi

PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments
commands=${@//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
echo "Shard ${GPU} commands:$commands"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
Expand Down
3 changes: 2 additions & 1 deletion .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN=$HF_TOKEN --name cpu-test cpu-test

# Run basic model test
docker exec cpu-test bash -c "
Expand Down
18 changes: 11 additions & 7 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,17 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py \
--ignore=tests/models/test_oot_registration.py \
--ignore=tests/models/test_registry.py \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/test_jamba.py \
--ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
pip install pytest matplotlib einops transformers_stream_generator datamodel_code_generator
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynanmic_per_token"

# online inference
docker exec cpu-test bash -c "
Expand Down
89 changes: 62 additions & 27 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ steps:
- tests/worker
commands:
- pytest -v -s async_engine # Async Engine
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
- pytest -v -s test_inputs.py
- pytest -v -s multimodal
- pytest -v -s test_utils.py # Utils
Expand Down Expand Up @@ -91,7 +92,7 @@ steps:
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/openai
- pytest -v -s entrypoints/test_chat_utils.py

- pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests

- label: Distributed Tests (4 GPUs) # 10min
working_dir: "/vllm-workspace/tests"
Expand Down Expand Up @@ -162,30 +163,13 @@ steps:
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py

- label: Models Test # 1hr10min
source_file_dependencies:
- vllm/
- tests/models
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models -m \"not vlm\" --ignore=models/test_oot_registration.py

- label: torch compile integration test
source_file_dependencies:
- vllm/
commands:
- pytest -v -s ./compile/test_full_graph.py
- pytest -v -s ./compile/test_wrapper.py


- label: Vision Language Models Test # 42min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
commands:
- pytest -v -s models -m vlm

- label: Prefix Caching Test # 7min
#mirror_hardwares: [amd]
source_file_dependencies:
Expand Down Expand Up @@ -217,7 +201,8 @@ steps:
commands:
# See https://github.com/vllm-project/vllm/issues/5152
- export VLLM_ATTENTION_BACKEND=XFORMERS
- pytest -v -s spec_decode
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py

- label: LoRA Test %N # 30min each
mirror_hardwares: [amd]
Expand All @@ -228,6 +213,7 @@ steps:
parallelism: 4

- label: Kernels Test %N # 30min each
mirror_hardwares: [amd]
source_file_dependencies:
- csrc/
- vllm/attention
Expand Down Expand Up @@ -282,6 +268,45 @@ steps:
commands:
- pytest -v -s tool_use

##### models test #####

- label: Basic Models Test # 3min
source_file_dependencies:
- vllm/
- tests/models
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models/*.py --ignore=models/test_oot_registration.py

- label: Decoder-only Language Models Test # 1h3min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language

- label: Decoder-only Multi-Modal Models Test # 56min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language
commands:
- pytest -v -s models/decoder_only/audio_language
- pytest -v -s models/decoder_only/vision_language

- label: Other Models Test # 5min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/embedding/language
- tests/models/encoder_decoder/language
commands:
- pytest -v -s models/embedding/language
- pytest -v -s models/encoder_decoder/language

##### 1 GPU test #####
##### multi gpus test #####

Expand All @@ -307,11 +332,11 @@ steps:
- tests/distributed/
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'

- label: Distributed Tests (2 GPUs) # 28min
#mirror_hardwares: [amd]
Expand All @@ -324,11 +349,10 @@ steps:
- vllm/model_executor/models/
- tests/distributed/
commands:
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
- TARGET_TEST_SUITE=L4 pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s distributed/test_basic_distributed_correctness_enc_dec.py
- pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s distributed/test_multimodal_broadcast.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py models/decoder_only/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py
Expand Down Expand Up @@ -386,7 +410,18 @@ steps:
- vllm/
- tests/weight_loading
commands:
- bash weight_loading/run_model_weight_loading_test.sh
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt

- label: Weight Loading Multiple GPU Test - Large Models # optional
working_dir: "/vllm-workspace/tests"
num_gpus: 2
gpu: a100
optional: true
source_file_dependencies:
- vllm/
- tests/weight_loading
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt


##### multi gpus test #####
Expand Down
9 changes: 9 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,15 @@ body:
</details>
validations:
required: true
- type: textarea
attributes:
label: Model Input Dumps
description: |
If you are facing crashing due to illegal memory access or other issues with model execution, vLLM may dump the problematic input of the model. In this case, you will see the message `Error in model execution (input dumped to /tmp/err_xxx.pkl)`. If you see this message, please zip the file (because GitHub doesn't support .pkl file format) and upload it here. This will help us to reproduce the issue and facilitate the debugging process.
placeholder: |
Upload the dumped input file.
validations:
required: false
- type: textarea
attributes:
label: 🐛 Describe the bug
Expand Down
10 changes: 10 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ FIX #xxxx (*link existing issues this PR will resolve*)
<li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li>
</ul>

<h3>Adding or changing kernels</h3>
<p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p>
<ul>
<li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li>
<li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li>
<li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li>
<li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li>
<li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>.
</ul>

<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p>

Expand Down
62 changes: 36 additions & 26 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -208,9 +208,13 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
FetchContent_Declare(
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
# CUTLASS 3.5.1
GIT_TAG 06b21349bcf6ddf6a1686a47a137ad1446579db9
GIT_TAG v3.5.1
GIT_PROGRESS TRUE

# Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history.
# Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags.
# So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(cutlass)

Expand Down Expand Up @@ -244,6 +248,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"-gencode arch=compute_90a,code=sm_90a")
endif()


#
# Machete kernels

Expand Down Expand Up @@ -307,28 +312,11 @@ define_gpu_extension_target(
USE_SABI 3
WITH_SOABI)

if(VLLM_GPU_LANG STREQUAL "HIP")
#
# custom extension
#
set(CUSTOM_SRC
"csrc/custom/torch_bindings.cpp"
"csrc/custom/custom_kernels.cu"
"csrc/custom/fused_kernels.cu"
"csrc/custom/custom.cu"
"csrc/custom/paged_attention/attention_ll4mi.cu"
)

define_gpu_extension_target(
_custom_C
DESTINATION vllm
LANGUAGE ${VLLM_GPU_LANG}
SOURCES ${CUSTOM_SRC}
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
USE_SABI 3
WITH_SOABI)
endif()
# If CUTLASS is compiled on NVCC >= 12.5, it by default uses
# cudaGetDriverEntryPointByVersion as a wrapper to avoid directly calling the
# driver API. This causes problems when linking with earlier versions of CUDA.
# Setting this variable sidesteps the issue by calling the driver directly.
target_compile_definitions(_C PRIVATE CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1)

#
# _moe_C extension
Expand All @@ -354,6 +342,28 @@ define_gpu_extension_target(
WITH_SOABI)


if(VLLM_GPU_LANG STREQUAL "HIP")
#
# _rocm_C extension
#
set(VLLM_ROCM_EXT_SRC
"csrc/rocm/torch_bindings.cpp"
"csrc/rocm/attention.cu"
"csrc/rocm/custom_kernels.cu"
"csrc/rocm/fused_kernels.cu"
"csrc/rocm/custom.cu")

define_gpu_extension_target(
_rocm_C
DESTINATION vllm
LANGUAGE ${VLLM_GPU_LANG}
SOURCES ${VLLM_ROCM_EXT_SRC}
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
USE_SABI 3
WITH_SOABI)
endif()


if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
message(STATUS "Enabling C extension.")
Expand All @@ -364,6 +374,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
endif()

if(VLLM_GPU_LANG STREQUAL "HIP")
message(STATUS "Enabling custom extension.")
add_dependencies(default _custom_C)
message(STATUS "Enabling rocm extension.")
add_dependencies(default _rocm_C)
endif()
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
&& apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl sudo vim python3-pip \
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update -y \
&& apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv libibverbs-dev \
Expand Down
Loading
Loading