Updating Branch (#26)

* [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032) Co-authored-by: Dipika <[email protected]> * [Frontend] Expose revision arg in OpenAI server (vllm-project#8501) * [BugFix] Fix clean shutdown issues (vllm-project#8492) * [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506) * [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270) * [doc] update doc on testing and debugging (vllm-project#8514) * [Bugfix] Bind api server port before starting engine (vllm-project#8491) * [perf bench] set timeout to debug hanging (vllm-project#8516) * [misc] small qol fixes for release process (vllm-project#8517) * [Bugfix] Fix 3.12 builds on main (vllm-project#8510) Signed-off-by: Joe Runde <[email protected]> * [refactor] remove triton based sampler (vllm-project#8524) * [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525) Signed-off-by: Alex-Brooks <[email protected]> * [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521) * [torch.compile] register allreduce operations as custom ops (vllm-project#8526) * [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509) Signed-off-by: Rui Qiao <[email protected]> * [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495) * [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631) * [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434) * [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515) Co-authored-by: Cyrus Leung <[email protected]> * [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527) * [Bugfix] Fix TP > 1 for new granite (vllm-project#8544) Signed-off-by: Joe Runde <[email protected]> * [doc] improve installation doc (vllm-project#8550) Co-authored-by: Andy Dai <[email protected]> * [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520) * [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012) * [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540) * [Misc] Add argument to disable FastAPI docs (vllm-project#8554) * [CI/Build] Avoid CUDA initialization (vllm-project#8534) * [CI/Build] Update Ruff version (vllm-project#8469) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157) Co-authored-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Simon Mo <[email protected]> * [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199) * [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543) Signed-off-by: Russell Bryant <[email protected]> * [Model] Support Solar Model (vllm-project#8386) Co-authored-by: Michael Goin <[email protected]> * [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380) Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039) * [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572) * [Bugfix] add `dead_error` property to engine client (vllm-project#8574) Signed-off-by: Joe Runde <[email protected]> * [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573) Co-authored-by: [email protected] * [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (vllm-project#8545) * Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593) * [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616) * [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615) * [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584) * [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577) * [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619) * [Doc] Add documentation for GGUF quantization (vllm-project#8618) * Create SECURITY.md (vllm-project#8642) * [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551) * [Misc] guard against change in cuda library name (vllm-project#8609) * [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571) * [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474) * [Core] Support Lora lineage and base model metadata management (vllm-project#6315) * [Model] Add OLMoE (vllm-project#7922) * [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670) * [Bugfix] Validate SamplingParam n is an int (vllm-project#8548) * [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649) * [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556) * [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640) * [Doc] neuron documentation update (vllm-project#8671) Signed-off-by: omrishiv <[email protected]> * [Hardware][AWS] update neuron to 2.20 (vllm-project#8676) Signed-off-by: omrishiv <[email protected]> * [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496) * [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673) * [MISC] add support custom_op check (vllm-project#8557) Co-authored-by: youkaichao <[email protected]> * [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675) * [beam search] add output for manually checking the correctness (vllm-project#8684) * [Kernel] Build flash-attn from source (vllm-project#8245) * [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687) * [Doc] Fix typo in AMD installation guide (vllm-project#8689) * [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646) * [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518) * [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643) * [Bugfix] Refactor composite weight loading logic (vllm-project#8656) * [ci][build] fix vllm-flash-attn (vllm-project#8699) * [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407) * [Misc] Use NamedTuple in Multi-image example (vllm-project#8705) Signed-off-by: Alex-Brooks <[email protected]> * [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703) * [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486) Co-authored-by: litianjian <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701) * [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713) * [misc] upgrade mistral-common (vllm-project#8715) * [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702) * [Bugfix] Fix CPU CMake build (vllm-project#8723) Co-authored-by: Yuan <[email protected]> * [Bugfix] fix docker build for xpu (vllm-project#8652) * [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657) Signed-off-by: Alex-Brooks <[email protected]> * [Hardware][CPU] Refactor CPU model runner (vllm-project#8729) * [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733) * [Model] Support pp for qwen2-vl (vllm-project#8696) * [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707) * [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738) Co-authored-by: youkaichao <[email protected]> * [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701) Co-authored-by: mgoin <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Kernel][LoRA] Add assertion for punica sgmv kernels (vllm-project#7585) * [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575) Signed-off-by: Russell Bryant <[email protected]> * Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562) * Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335) * [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674) * Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728) * re-implement beam search on top of vllm core (vllm-project#8726) Co-authored-by: Brendan Wong <[email protected]> * Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750) * [MISC] Skip dumping inputs when unpicklable (vllm-project#8744) * [Core][Model] Support loading weights by ID within models (vllm-project#7931) * [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558) * [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661) Co-authored-by: mgoin <[email protected]> * [Frontend] Batch inference for llm.chat() API (vllm-project#8648) Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748) * [CI/Build] fix setuptools-scm usage (vllm-project#8771) * [misc] soft drop beam search (vllm-project#8763) * [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768) * [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047) Signed-off-by: Travis Johnson <[email protected]> * [Core] Adding Priority Scheduling (vllm-project#5958) * [Bugfix] Use heartbeats instead of health checks (vllm-project#8583) * Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780) * [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776) * Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752) * [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250) * [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770) * [Bugfix] load fc bias from config for eagle (vllm-project#8790) --------- Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: omrishiv <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Dipika <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Kevin Lin <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: chenqianfzh <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Andy Dai <[email protected]> Co-authored-by: Alexey Kondratiev(AMD) <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Geun, Lim <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: 盏一 <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Amit Garg <[email protected]> Co-authored-by: William Lin <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: saumya-saran <[email protected]> Co-authored-by: Pastel！ <[email protected]> Co-authored-by: omrishiv <[email protected]> Co-authored-by: zyddnys <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Huazhong Ji <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: jiqing-feng <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Brendan Wong <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Peter Salas <[email protected]> Co-authored-by: Hanzhi Zhou <[email protected]> Co-authored-by: Andy <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Archit Patke <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: sohamparikh <[email protected]>
Manikandan-Thangaraj-ZS0321 · Sep 25, 2024 · b4930c2 · b4930c2
1 parent 1572362
commit b4930c2
Show file tree

Hide file tree

Showing 342 changed files with 14,836 additions and 6,876 deletions.
diff --git a/.buildkite/nightly-benchmarks/benchmark-pipeline.yaml b/.buildkite/nightly-benchmarks/benchmark-pipeline.yaml
@@ -8,8 +8,7 @@ steps:
           containers:
           - image: badouralix/curl-jq
             command:
-            - sh
-            - .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
+            - sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
   - wait
   - label: "A100"
     agents:

diff --git a/.buildkite/nightly-benchmarks/scripts/wait-for-image.sh b/.buildkite/nightly-benchmarks/scripts/wait-for-image.sh
@@ -2,9 +2,11 @@
 TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
 URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
 
+TIMEOUT_SECONDS=10
+
 retries=0
 while [ $retries -lt 1000 ]; do
-    if [ $(curl -s -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
+    if [ $(curl -s --max-time $TIMEOUT_SECONDS -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
         exit 0
     fi
 

diff --git a/.buildkite/run-amd-test.sh b/.buildkite/run-amd-test.sh
@@ -83,6 +83,7 @@ if [[ $commands == *" kernels "* ]]; then
   --ignore=kernels/test_encoder_decoder_attn.py \
   --ignore=kernels/test_flash_attn.py \
   --ignore=kernels/test_flashinfer.py \
+  --ignore=kernels/test_gguf.py \
   --ignore=kernels/test_int8_quant.py \
   --ignore=kernels/test_machete_gemm.py \
   --ignore=kernels/test_mamba_ssm.py \
@@ -93,6 +94,16 @@ if [[ $commands == *" kernels "* ]]; then
   --ignore=kernels/test_sampler.py"
 fi
 
+#ignore certain Entrypoints tests
+if [[ $commands == *" entrypoints/openai "* ]]; then
+  commands=${commands//" entrypoints/openai "/" entrypoints/openai \
+  --ignore=entrypoints/openai/test_accuracy.py \
+  --ignore=entrypoints/openai/test_audio.py \
+  --ignore=entrypoints/openai/test_encoder_decoder.py \
+  --ignore=entrypoints/openai/test_embedding.py \
+  --ignore=entrypoints/openai/test_oot_registration.py "}
+fi
+
 PARALLEL_JOB_COUNT=8
 # check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs. 
 if [[ $commands == *"--shard-id="* ]]; then

diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml
@@ -43,13 +43,15 @@ steps:
   fast_check: true
   source_file_dependencies:
   - vllm/
+  - tests/mq_llm_engine
   - tests/async_engine
   - tests/test_inputs
   - tests/multimodal
   - tests/test_utils
   - tests/worker
   commands:
-  - pytest -v -s async_engine # Async Engine
+  - pytest -v -s mq_llm_engine # MQLLMEngine
+  - pytest -v -s async_engine # AsyncLLMEngine
   - NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
   - pytest -v -s test_inputs.py
   - pytest -v -s multimodal
@@ -82,7 +84,7 @@ steps:
 - label: Entrypoints Test # 20min
   working_dir: "/vllm-workspace/tests"
   fast_check: true
-  #mirror_hardwares: [amd]
+  mirror_hardwares: [amd]
   source_file_dependencies:
   - vllm/
   commands:
@@ -163,13 +165,6 @@ steps:
     - python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
     - python3 offline_inference_encoder_decoder.py
 
-- label: torch compile integration test
-  source_file_dependencies:
-  - vllm/
-  commands:
-    - pytest -v -s ./compile/test_full_graph.py
-    - pytest -v -s ./compile/test_wrapper.py
-
 - label: Prefix Caching Test # 7min
   #mirror_hardwares: [amd]
   source_file_dependencies:
@@ -259,6 +254,13 @@ steps:
   - export VLLM_WORKER_MULTIPROC_METHOD=spawn
   - bash ./run-tests.sh -c configs/models-small.txt -t 1
 
+- label: Encoder Decoder tests # 5min
+  source_file_dependencies:
+  - vllm/
+  - tests/encoder_decoder
+  commands:
+    - pytest -v -s encoder_decoder
+
 - label: OpenAI-Compatible Tool Use # 20 min
   fast_check: false
   mirror_hardwares: [ amd ]
@@ -348,7 +350,10 @@ steps:
   - vllm/executor/
   - vllm/model_executor/models/
   - tests/distributed/
+  - vllm/compilation
   commands:
+  - pytest -v -s ./compile/test_full_graph.py
+  - pytest -v -s ./compile/test_wrapper.py
   - VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
   - TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
   # Avoid importing model tests that cause CUDA reinitialization error

diff --git a/.github/workflows/ruff.yml b/.github/workflows/ruff.yml
@@ -25,10 +25,10 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install ruff==0.1.5 codespell==2.3.0 tomli==2.0.1 isort==5.13.2
+        pip install -r requirements-lint.txt
     - name: Analysing the code with ruff
       run: |
-        ruff .
+        ruff check .
     - name: Spelling check with codespell
       run: |
         codespell --toml pyproject.toml

diff --git a/.github/workflows/scripts/build.sh b/.github/workflows/scripts/build.sh
@@ -15,5 +15,6 @@ $python_executable -m pip install -r requirements-cuda.txt
 export MAX_JOBS=1
 # Make sure release wheels are built for the following architectures
 export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
+export VLLM_FA_CMAKE_GPU_ARCHES="80-real;90-real"
 # Build
 $python_executable setup.py bdist_wheel --dist-dir=dist
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,8 @@
-# vllm commit id, generated by setup.py
-vllm/commit_id.py
+# version file generated by setuptools-scm
+/vllm/_version.py
+
+# vllm-flash-attn built from source
+vllm/vllm_flash_attn/
 
 # Byte-compiled / optimized / DLL files
 __pycache__/
@@ -12,6 +15,8 @@ __pycache__/
 # Distribution / packaging
 .Python
 build/
+cmake-build-*/
+CMakeUserPresets.json
 develop-eggs/
 dist/
 downloads/

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,5 +1,16 @@
 cmake_minimum_required(VERSION 3.26)
 
+# When building directly using CMake, make sure you run the install step
+# (it places the .so files in the correct location).
+#
+# Example:
+# mkdir build && cd build
+# cmake -G Ninja -DVLLM_PYTHON_EXECUTABLE=`which python3` -DCMAKE_INSTALL_PREFIX=.. ..
+# cmake --build . --target install
+#
+# If you want to only build one target, make sure to install it manually:
+# cmake --build . --target _C
+# cmake --install . --component _C
 project(vllm_extensions LANGUAGES CXX)
 
 # CUDA by default, can be overridden by using -DVLLM_TARGET_DEVICE=... (used by setup.py)
@@ -13,6 +24,9 @@ include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
 # Suppress potential warnings about unused manually-specified variables
 set(ignoreMe "${VLLM_PYTHON_PATH}")
 
+# Prevent installation of dependencies (cutlass) by default.
+install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)
+
 #
 # Supported python versions.  These versions will be searched in order, the
 # first match will be selected.  These should be kept in sync with setup.py.
@@ -70,19 +84,6 @@ endif()
 find_package(Torch REQUIRED)
 
 #
-# Add the `default` target which detects which extensions should be
-# built based on platform/architecture.  This is the same logic that
-# setup.py uses to select which extensions should be built and should
-# be kept in sync.
-#
-# The `default` target makes direct use of cmake easier since knowledge
-# of which extensions are supported has been factored in, e.g.
-#
-# mkdir build && cd build
-# cmake -G Ninja -DVLLM_PYTHON_EXECUTABLE=`which python3` -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=../vllm ..
-# cmake --build . --target default
-#
-add_custom_target(default)
 message(STATUS "Enabling core extension.")
 
 # Define _core_C extension
@@ -100,8 +101,6 @@ define_gpu_extension_target(
   USE_SABI 3
   WITH_SOABI)
 
-add_dependencies(default _core_C)
-
 #
 # Forward the non-CUDA device extensions to external CMake scripts.
 #
@@ -167,6 +166,8 @@ if(NVCC_THREADS AND VLLM_GPU_LANG STREQUAL "CUDA")
   list(APPEND VLLM_GPU_FLAGS "--threads=${NVCC_THREADS}")
 endif()
 
+include(FetchContent)
+
 #
 # Define other extension targets
 #
@@ -190,8 +191,11 @@ set(VLLM_EXT_SRC
   "csrc/torch_bindings.cpp")
 
 if(VLLM_GPU_LANG STREQUAL "CUDA")
-  include(FetchContent)
   SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")
+
+  # Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
+  set(CUTLASS_REVISION "v3.5.1" CACHE STRING "CUTLASS revision to use")
+
   FetchContent_Declare(
         cutlass
         GIT_REPOSITORY https://github.com/nvidia/cutlass.git
@@ -219,6 +223,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
     "csrc/quantization/gguf/gguf_kernel.cu"
     "csrc/quantization/fp8/fp8_marlin.cu"
     "csrc/custom_all_reduce.cu"
+    "csrc/permute_cols.cu"
     "csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu"
     "csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu"
     "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu")
@@ -283,6 +288,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
     csrc/quantization/machete/machete_pytorch.cu)
 endif()
 
+message(STATUS "Enabling C extension.")
 define_gpu_extension_target(
   _C
   DESTINATION vllm
@@ -310,9 +316,15 @@ set(VLLM_MOE_EXT_SRC
 
 if(VLLM_GPU_LANG STREQUAL "CUDA")
   list(APPEND VLLM_MOE_EXT_SRC
+      "csrc/moe/marlin_kernels/marlin_moe_kernel.h"
+      "csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.h"
+      "csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu"
+      "csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h"
+      "csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu"
       "csrc/moe/marlin_moe_ops.cu")
 endif()
 
+message(STATUS "Enabling moe extension.")
 define_gpu_extension_target(
   _moe_C
   DESTINATION vllm
@@ -323,7 +335,6 @@ define_gpu_extension_target(
   USE_SABI 3
   WITH_SOABI)
 
-
 if(VLLM_GPU_LANG STREQUAL "HIP")
   #
   # _rocm_C extension
@@ -343,16 +354,66 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
     WITH_SOABI)
 endif()
 
+# vllm-flash-attn currently only supported on CUDA
+if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda")
+  return()
+endif ()
 
-if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
-  message(STATUS "Enabling C extension.")
-  add_dependencies(default _C)
+#
+# Build vLLM flash attention from source
+#
+# IMPORTANT: This has to be the last thing we do, because vllm-flash-attn uses the same macros/functions as vLLM.
+# Because functions all belong to the global scope, vllm-flash-attn's functions overwrite vLLMs.
+# They should be identical but if they aren't, this is a massive footgun.
+#
+# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
+# To only install vllm-flash-attn, use --component vllm_flash_attn_c.
+# If no component is specified, vllm-flash-attn is still installed.
 
-  message(STATUS "Enabling moe extension.")
-  add_dependencies(default _moe_C)
+# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
+# This is to enable local development of vllm-flash-attn within vLLM.
+# It can be set as an environment variable or passed as a cmake argument.
+# The environment variable takes precedence.
+if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
+  set(VLLM_FLASH_ATTN_SRC_DIR $ENV{VLLM_FLASH_ATTN_SRC_DIR})
 endif()
 
-if(VLLM_GPU_LANG STREQUAL "HIP")
-  message(STATUS "Enabling rocm extension.")
-  add_dependencies(default _rocm_C)
+if(VLLM_FLASH_ATTN_SRC_DIR)
+  FetchContent_Declare(vllm-flash-attn SOURCE_DIR ${VLLM_FLASH_ATTN_SRC_DIR})
+else()
+  FetchContent_Declare(
+          vllm-flash-attn
+          GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
+          GIT_TAG 013f0c4fc47e6574060879d9734c1df8c5c273bd
+          GIT_PROGRESS TRUE
+  )
 endif()
+
+# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
+set(VLLM_PARENT_BUILD ON)
+
+# Ensure the vllm/vllm_flash_attn directory exists before installation
+install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)
+
+# Make sure vllm-flash-attn install rules are nested under vllm/
+install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
+install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
+install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)
+
+# Fetch the vllm-flash-attn library
+FetchContent_MakeAvailable(vllm-flash-attn)
+message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")
+
+# Restore the install prefix
+install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
+install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
+
+# Copy over the vllm-flash-attn python files
+install(
+        DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
+        DESTINATION vllm/vllm_flash_attn
+        COMPONENT vllm_flash_attn_c
+        FILES_MATCHING PATTERN "*.py"
+)
+
+# Nothing after vllm-flash-attn, see comment about macros above
diff --git a/Dockerfile b/Dockerfile
@@ -48,6 +48,9 @@ RUN --mount=type=cache,target=/root/.cache/pip \
 # see https://github.com/pytorch/pytorch/pull/123243
 ARG torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0+PTX'
 ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
+# Override the arch list for flash-attn to reduce the binary size
+ARG vllm_fa_cmake_gpu_arches='80-real;90-real'
+ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches}
 #################### BASE BUILD IMAGE ####################
 
 #################### WHEEL BUILD IMAGE ####################
@@ -76,14 +79,13 @@ ENV MAX_JOBS=${max_jobs}
 ARG nvcc_threads=8
 ENV NVCC_THREADS=$nvcc_threads
 
-ARG buildkite_commit
-ENV BUILDKITE_COMMIT=${buildkite_commit}
-
 ARG USE_SCCACHE
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
+ARG SCCACHE_S3_NO_CREDENTIALS=0
 # if USE_SCCACHE is set, use sccache to speed up compilation
 RUN --mount=type=cache,target=/root/.cache/pip \
+    --mount=type=bind,source=.git,target=.git \
     if [ "$USE_SCCACHE" = "1" ]; then \
         echo "Installing sccache..." \
         && curl -L -o sccache.tar.gz https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-unknown-linux-musl.tar.gz \
@@ -92,6 +94,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
         && rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \
         && export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
         && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
+        && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
         && export SCCACHE_IDLE_TIMEOUT=0 \
         && export CMAKE_BUILD_TYPE=Release \
         && sccache --show-stats \
@@ -102,6 +105,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
 ENV CCACHE_DIR=/root/.cache/ccache
 RUN --mount=type=cache,target=/root/.cache/ccache \
     --mount=type=cache,target=/root/.cache/pip \
+    --mount=type=bind,source=.git,target=.git  \
     if [ "$USE_SCCACHE" != "1" ]; then \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi
@@ -180,10 +184,6 @@ FROM vllm-base AS test
 ADD . /vllm-workspace/
 
 # install development dependencies (for testing)
-# A newer setuptools is required for installing some test dependencies from source that do not publish python 3.12 wheels
-# This installation must complete before the test dependencies are collected and installed.
-RUN --mount=type=cache,target=/root/.cache/pip \
-    python3 -m pip install "setuptools>=74.1.1"
 RUN --mount=type=cache,target=/root/.cache/pip \
     python3 -m pip install -r requirements-dev.txt