Skip to content

Commit

Permalink
Merging main (#4)
Browse files Browse the repository at this point in the history
* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114)

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters

* Adding HTTP headers

* Add distributed executor backend to benchmark scripts (vllm-project#118)

* Add weight padding for moe (vllm-project#119)

* add weight padding for moe

* enable padding by default

* fix linter

* fix linter

* fix linter

* using envs.py

* fix linter

* [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116)

* fix navi build

* Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime

* replacing ifdefs on host code with those on kernels

* refactoring code to avoid unsupported call on Navi

* syntactic change

* import statements fix

* moving env variables to envs.py

* style fixes

* cosmetic changes for isort

* remved extra include

* moving use_skinny to be member

---------

Co-authored-by: lcskrishna <[email protected]>
Co-authored-by: maleksan85 <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>

* add emtpy_cache() after each padding (vllm-project#120)

* [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124)

* add memory clean up after every shape and parameter to reduce cache invalidation buffers

* small typo

* syntax change

---------

Co-authored-by: maleksan85 <[email protected]>

* save shape when fp8 solution not found (vllm-project#123)

Co-authored-by: Gregory Shtrasberg <[email protected]>

* Fix unit test for moe by adding padding (vllm-project#128)

* fix test_moe

* fix linter

* Llama3.1 (vllm-project#129)

* Add support for a rope extension method (vllm-project#6553)

* [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693)

---------

Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>

* chat/completions endpoint (vllm-project#121)

* Initial implementation of chat/completions endpoint and its streaming variant

* Reusing datatypes from the openai entrypoints

* Response role from arg

* Added models endpoint and model validation from the request

* Optimize custom all reduce (vllm-project#130)

* First version

* Revert error.

While there, add missing finalize.

* Use the correct defaults for ROCm.

Increase sampling area to capture crossover.

* Scope end_sync as well.

* Guard only volatile keyword for ifndef USE_ROCM

* Document crossover

* Add BF16 support to custom PA (vllm-project#133)

* tightened atol for custom PA; enable supported head size, block sizes in testing

* update num_blocks and num_iters in benchmark PA to realistic settings

* move to generic b16 type

* bf16 first port

* enabled all bf16 tests, set atol for bf16

* enable custom PA for bf16 as well as block size 32 and head size 64

* fix cast to zero in custom PA reduce

* py linter fixes

* clang format fixes

* div round up clang-format

---------

Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>

* Making check for output match in original types. It saves some memory. (vllm-project#135)

Co-authored-by: maleksan85 <[email protected]>

* Make CAR ROCm 6.1 compatible. (vllm-project#137)

* remove scoping
* while there fix a typo
* while there remove unused variable

* Car revert (vllm-project#140)

* Per @iotamudelta suggestion until the deadlocks issue is better understood
Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)"

This reverts commit 4d2dda6.

* Per @iotamudelta suggestion until the deadlocks issue is better understood
Revert "Optimize custom all reduce (vllm-project#130)"

This reverts commit 636ff01.

* Using the correct datatypes for streaming non-chat completions (vllm-project#134)

* Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138)

* Adding UNREACHABLE_CODE macro

* clang format fixes

* clang formatting fix

* minor updates in syntax

* clang format update

* clang format fix one more try

* clang format one more try

* clang format fix one more try

---------

Co-authored-by: Aleksandr Malyshev <[email protected]>

* gfx90a typo fix (vllm-project#142)

Co-authored-by: maleksan85 <[email protected]>

* wvsplitk templatized and better tuned for MI300 (vllm-project#132)

* improvements to wvSpltK

* wvsplt gemm; better handle MI300 and large A[] sizes

* lint fix

* Adjustments to better handle small weights in TP8.

* early-out bug fix

* better wave load balancing in wvSplt

* add missing skip for wvsplt_big

* Bug fix for wvSplt_big in load balancing at M4, lint fix.

* [Bugfix] Dockerfile.rocm (vllm-project#141)

* Dockerfile.rocm bug fix

* naming preference

---------

Co-authored-by: Gregory Shtrasberg <[email protected]>

* Update test-template.j2 (vllm-project#145)

* Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136)

* basic support for AWQ added
* awq_dequantize implementation in Triton
* awq_gemm implementation in Triton
* unit tests in tests/kernels/test_awq_triton.py

---------

Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Matt Wong <[email protected]>
Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: Aleksandr Malyshev <[email protected]>
Co-authored-by: lcskrishna <[email protected]>
Co-authored-by: maleksan85 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: iotamudelta <[email protected]>
Co-authored-by: sanyalington <[email protected]>
Co-authored-by: Hashem Hashemi <[email protected]>
Co-authored-by: Zachary Streeter <[email protected]>
Co-authored-by: omkar kakarparthi <[email protected]>
Co-authored-by: rasmith <[email protected]>
  • Loading branch information
15 people authored Aug 21, 2024
1 parent a6414b8 commit cec14e0
Show file tree
Hide file tree
Showing 26 changed files with 2,019 additions and 1,506 deletions.
4 changes: 2 additions & 2 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ steps:
- "docker push {{ docker_image_amd }}"
plugins:
- docker-login#v3.0.0:
username: rocmshared
username: rocm
key: "amd-build"
env:
DOCKER_BUILDKIT: "1"
Expand All @@ -38,4 +38,4 @@ steps:
priority: 100
soft_fail: true
{% endif %}
{% endfor %}
{% endfor %}
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ set(PYTHON_SUPPORTED_VERSIONS "3.8" "3.9" "3.10" "3.11")
set(CUDA_SUPPORTED_ARCHS "7.0;7.5;8.0;8.6;8.9;9.0")

# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100")
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101")

#
# Supported/expected torch versions for CUDA/ROCm.
Expand Down
4 changes: 2 additions & 2 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ USER root
ARG BASE_IMAGE
ARG COMMON_WORKDIR
# Used as ARCHes for all components
ARG PYTORCH_ROCM_ARCH="gfx90a;gfx942"
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}
ARG ARG_PYTORCH_ROCM_ARCH="gfx90a;gfx942"
ENV PYTORCH_ROCM_ARCH=${ARG_PYTORCH_ROCM_ARCH}

# Install some basic utilities
RUN apt-get update && apt-get install python3 python3-pip -
Expand Down
53 changes: 32 additions & 21 deletions benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,27 +19,30 @@ def main(args: argparse.Namespace):

# NOTE(woosuk): If the request cannot be processed in a single batch,
# the engine will automatically process the request in multiple batches.
llm = LLM(model=args.model,
speculative_model=args.speculative_model,
num_speculative_tokens=args.num_speculative_tokens,
tokenizer=args.tokenizer,
quantization=args.quantization,
quantized_weights_path=args.quantized_weights_path,
tensor_parallel_size=args.tensor_parallel_size,
trust_remote_code=args.trust_remote_code,
dtype=args.dtype,
enforce_eager=args.enforce_eager,
kv_cache_dtype=args.kv_cache_dtype,
quantization_param_path=args.quantization_param_path,
device=args.device,
ray_workers_use_nsight=args.ray_workers_use_nsight,
worker_use_ray=args.worker_use_ray,
use_v2_block_manager=args.use_v2_block_manager,
enable_chunked_prefill=args.enable_chunked_prefill,
download_dir=args.download_dir,
block_size=args.block_size,
disable_custom_all_reduce=args.disable_custom_all_reduce,
gpu_memory_utilization=args.gpu_memory_utilization)
llm = LLM(
model=args.model,
speculative_model=args.speculative_model,
num_speculative_tokens=args.num_speculative_tokens,
tokenizer=args.tokenizer,
quantization=args.quantization,
quantized_weights_path=args.quantized_weights_path,
tensor_parallel_size=args.tensor_parallel_size,
trust_remote_code=args.trust_remote_code,
dtype=args.dtype,
enforce_eager=args.enforce_eager,
kv_cache_dtype=args.kv_cache_dtype,
quantization_param_path=args.quantization_param_path,
device=args.device,
ray_workers_use_nsight=args.ray_workers_use_nsight,
worker_use_ray=args.worker_use_ray,
use_v2_block_manager=args.use_v2_block_manager,
enable_chunked_prefill=args.enable_chunked_prefill,
download_dir=args.download_dir,
block_size=args.block_size,
disable_custom_all_reduce=args.disable_custom_all_reduce,
gpu_memory_utilization=args.gpu_memory_utilization,
distributed_executor_backend=args.distributed_executor_backend,
)

sampling_params = SamplingParams(
n=args.n,
Expand Down Expand Up @@ -237,5 +240,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
help='the fraction of GPU memory to be used for '
'the model executor, which can range from 0 to 1.'
'If unspecified, will use the default value of 0.9.')
parser.add_argument(
'--distributed-executor-backend',
choices=['ray', 'mp', 'torchrun'],
default=None,
help='Backend to use for distributed serving. When more than 1 GPU '
'is used, on CUDA this will be automatically set to "ray" if '
'installed or "mp" (multiprocessing) otherwise. On ROCm, this is '
'instead set to torchrun by default.')
args = parser.parse_args()
main(args)
15 changes: 13 additions & 2 deletions benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ def run_vllm(
enable_prefix_caching: bool,
enable_chunked_prefill: bool,
max_num_batched_tokens: int,
distributed_executor_backend: Optional[str],
gpu_memory_utilization: float = 0.9,
worker_use_ray: bool = False,
download_dir: Optional[str] = None,
Expand All @@ -104,6 +105,7 @@ def run_vllm(
download_dir=download_dir,
enable_chunked_prefill=enable_chunked_prefill,
max_num_batched_tokens=max_num_batched_tokens,
distributed_executor_backend=distributed_executor_backend,
)

# Add the requests to the engine.
Expand Down Expand Up @@ -229,8 +231,9 @@ def main(args: argparse.Namespace):
args.max_model_len, args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device,
args.enable_prefix_caching, args.enable_chunked_prefill,
args.max_num_batched_tokens, args.gpu_memory_utilization,
args.worker_use_ray, args.download_dir)
args.max_num_batched_tokens, args.distributed_executor_backend,
args.gpu_memory_utilization, args.worker_use_ray,
args.download_dir)
elif args.backend == "hf":
assert args.tensor_parallel_size == 1
elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
Expand Down Expand Up @@ -384,6 +387,14 @@ def main(args: argparse.Namespace):
type=str,
default=None,
help='Path to save the throughput results in JSON format.')
parser.add_argument(
'--distributed-executor-backend',
choices=['ray', 'mp', 'torchrun'],
default=None,
help='Backend to use for distributed serving. When more than 1 GPU '
'is used, on CUDA this will be automatically set to "ray" if '
'installed or "mp" (multiprocessing) otherwise. On ROCm, this is '
'instead set to torchrun by default.')
args = parser.parse_args()
if args.tokenizer is None:
args.tokenizer = args.model
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/kernels/benchmark_paged_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from vllm._custom_C import paged_attention_custom
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random

NUM_BLOCKS = 1024
NUM_BLOCKS = 1024 * 1024
PARTITION_SIZE = 256


Expand Down Expand Up @@ -176,7 +176,7 @@ def run_cuda_benchmark(num_iters: int, profile: bool = False) -> float:
if do_profile:
latency = run_benchmark(num_iters=1, profile=True)
else:
latency = run_benchmark(num_iters=100, profile=False)
latency = run_benchmark(num_iters=1000, profile=False)
print(f"Kernel running time: {latency * 1000000:.3f} us")


Expand Down
Loading

0 comments on commit cec14e0

Please sign in to comment.