habana_main rebase (#71)

* [Hardware][Intel] Optimize CPU backend and add more performance tips (vllm-project#4971) Co-authored-by: Jianan Gu <[email protected]> * [Docs] Add 4th meetup slides (vllm-project#5509) * [Misc] Add vLLM version getter to utils (vllm-project#5098) * [CI/Build] Simplify OpenAI server setup in tests (vllm-project#5100) * [Doc] Update LLaVA docs (vllm-project#5437) Co-authored-by: Roger Wang <[email protected]> * [Kernel] Factor out epilogues from cutlass kernels (vllm-project#5391) Co-authored-by: Michael Goin <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [MISC] Remove FP8 warning (vllm-project#5472) Co-authored-by: Philipp Moritz <[email protected]> * Seperate dev requirements into lint and test (vllm-project#5474) * Revert "[Core] Remove unnecessary copies in flash attn backend" (vllm-project#5478) * [misc] fix format.sh (vllm-project#5511) * [CI/Build] Disable test_fp8.py (vllm-project#5508) * [Kernel] Disable CUTLASS kernels for fp8 (vllm-project#5505) * Add `cuda_device_count_stateless` (vllm-project#5473) * [Hardware][Intel] Support CPU inference with AVX2 ISA (vllm-project#5452) * [Misc] Fix arg names in quantizer script (vllm-project#5507) * bump version to v0.5.0.post1 (vllm-project#5522) * [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (vllm-project#5073) Co-authored-by: simon-mo <[email protected]> * [CI/Build] Disable LLaVA-NeXT CPU test (vllm-project#5529) * [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (vllm-project#5516) * [Misc] Fix arg names (vllm-project#5524) * [ Misc ] Rs/compressed tensors cleanup (vllm-project#5432) Co-authored-by: mgoin <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> * [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (vllm-project#5401) * [mis] fix flaky test of test_cuda_device_count_stateless (vllm-project#5546) * [Core] Remove duplicate processing in async engine (vllm-project#5525) * [misc][distributed] fix benign error in `is_in_the_same_node` (vllm-project#5512) * [Docs] Add ZhenFund as a Sponsor (vllm-project#5548) * [Doc] Update documentation on Tensorizer (vllm-project#5471) * [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (vllm-project#5460) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix] Fix typo in Pallas backend (vllm-project#5558) * [Core][Distributed] improve p2p cache generation (vllm-project#5528) * Add ccache to amd (vllm-project#5555) * [Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#5364) Signed-off-by: Lei Wen <[email protected]> Co-authored-by: Lei Wen <[email protected]> * [mypy] Enable type checking for test directory (vllm-project#5017) * [CI/Build] Test both text and token IDs in batched OpenAI Completions API (vllm-project#5568) * [misc] Do not allow to use lora with chunked prefill. (vllm-project#5538) Co-authored-by: Cyrus Leung <[email protected]> * add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145) * [BugFix] Don't start a Ray cluster when not using Ray (vllm-project#5570) * [Fix] Correct OpenAI batch response format (vllm-project#5554) * Add basic correctness 2 GPU tests to 4 GPU pipeline (vllm-project#5518) * [CI][BugFix] Flip is_quant_method_supported condition (vllm-project#5577) * [build][misc] limit numpy version (vllm-project#5582) * [Doc] add debugging tips for crash and multi-node debugging (vllm-project#5581) * Fix w8a8 benchmark and add Llama-3-8B (vllm-project#5562) * [Model] Rename Phi3 rope scaling type (vllm-project#5595) * Correct alignment in the seq_len diagram. (vllm-project#5592) Co-authored-by: Liqian Chen <[email protected]> * [Kernel] `compressed-tensors` marlin 24 support (vllm-project#5435) * [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (vllm-project#5588) * [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814) Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> * [CI/BUILD] Support non-AVX512 vLLM building and testing (vllm-project#5574) * [CI] the readability of benchmarking and prepare for dashboard (vllm-project#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (vllm-project#5571) * [bugfix][distributed] fix 16 gpus local rank arrangement (vllm-project#5604) * [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (vllm-project#5584) * [Bugfix] Fix KV head calculation for MPT models when using GQA (vllm-project#5142) * [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (vllm-project#5606) * [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (vllm-project#5131) * [Model] Initialize Phi-3-vision support (vllm-project#4986) * [Kernel] Add punica dimensions for Granite 13b (vllm-project#5559) Signed-off-by: Joe Runde <[email protected]> * [misc][typo] fix typo (vllm-project#5620) * [Misc] Fix typo (vllm-project#5618) * [CI] Avoid naming different metrics with the same name in performance benchmark (vllm-project#5615) * [bugfix][distributed] improve p2p capability test (vllm-project#5612) [bugfix][distributed] do not error if two processes do not agree on p2p capability (vllm-project#5612) * [Misc] Remove import from transformers logging (vllm-project#5625) * [CI/Build][Misc] Update Pytest Marker for VLMs (vllm-project#5623) * [ci] Deprecate original CI template (vllm-project#5624) Signed-off-by: kevin <[email protected]> * [Misc] Add OpenTelemetry support (vllm-project#4687) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here * [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (vllm-project#5542) * [ci] Setup Release pipeline and build release wheels with cache (vllm-project#5610) Signed-off-by: kevin <[email protected]> * [Model] LoRA support added for command-r (vllm-project#5178) * [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (vllm-project#5639) Signed-off-by: Thomas Parnell <[email protected]> * [Doc] Added cerebrium as Integration option (vllm-project#5553) * [Bugfix] Fix CUDA version check for mma warning suppression (vllm-project#5642) * [Bugfix] Fix w8a8 benchmarks for int8 case (vllm-project#5643) * [Bugfix] Fix Phi-3 Long RoPE scaling implementation (vllm-project#5628) * [Bugfix] Added test for sampling repetition penalty bug. (vllm-project#5659) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (vllm-project#5641) * [misc][distributed] use 127.0.0.1 for single-node (vllm-project#5619) * [Model] Add FP8 kv cache for Qwen2 (vllm-project#5656) * [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (vllm-project#5684) * [Misc]Add param max-model-len in benchmark_latency.py (vllm-project#5629) * [CI/Build] Add tqdm to dependencies (vllm-project#5680) * [ci] Add A100 queue into AWS CI template (vllm-project#5648) Signed-off-by: kevin <[email protected]> * [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (vllm-project#5688) * [ci][distributed] add tests for custom allreduce (vllm-project#5689) * [Bugfix] AsyncLLMEngine hangs with asyncio.run (vllm-project#5654) * [Doc] Update docker references (vllm-project#5614) Signed-off-by: Rafael Vasquez <[email protected]> * [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (vllm-project#5650) * [ci] Limit num gpus if specified for A100 (vllm-project#5694) Signed-off-by: kevin <[email protected]> * [Misc] Improve conftest (vllm-project#5681) * [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (vllm-project#5703) * [Kernel] Update Cutlass int8 kernel configs for SM90 (vllm-project#5514) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Model] Port over CLIPVisionModel for VLMs (vllm-project#5591) * [Kernel] Update Cutlass int8 kernel configs for SM80 (vllm-project#5275) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (vllm-project#5715) * [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (vllm-project#5718) * [distributed][misc] use fork by default for mp (vllm-project#5669) * [Model] MLPSpeculator speculative decoding support (vllm-project#4947) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Davis Wertheimer <[email protected]> * [Kernel] Add punica dimension for Qwen2 LoRA (vllm-project#5441) * [BugFix] Fix test_phi3v.py (vllm-project#5725) * [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (vllm-project#5665) Co-authored-by: Antoni Baum <[email protected]> * [Core][Distributed] add shm broadcast (vllm-project#5399) Co-authored-by: Cody Yu <[email protected]> * [Kernel][CPU] Add Quick `gelu` to CPU (vllm-project#5717) * [Doc] Documentation on supported hardware for quantization methods (vllm-project#5745) * [BugFix] exclude version 1.15.0 for modelscope (vllm-project#5668) * [ci][test] fix ca test in main (vllm-project#5746) * [LoRA] Add support for pinning lora adapters in the LRU cache (vllm-project#5603) * [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (vllm-project#5616) * [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (vllm-project#5710) Co-authored-by: Roger Wang <[email protected]> * [Misc] Remove vllm-project#4789 workaround left in vllm/entrypoints/openai/run_batch.py (vllm-project#5756) * [Bugfix] Fix pin_lora error in TPU executor (vllm-project#5760) * [Docs][TPU] Add installation tip for TPU (vllm-project#5761) * [core][distributed] improve shared memory broadcast (vllm-project#5754) * [BugFix] [Kernel] Add Cutlass2x fallback kernels (vllm-project#5744) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Distributed] Add send and recv helpers (vllm-project#5719) * [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772) * [doc][faq] add warning to download models for every nodes (vllm-project#5783) * post-rebase api adjustments * [Doc] Add "Suggest edit" button to doc pages (vllm-project#5789) * [Doc] Add Phi-3-medium to list of supported models (vllm-project#5788) * [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795) * [ci] Remove aws template (vllm-project#5757) Signed-off-by: kevin <[email protected]> * [Doc] Add notice about breaking changes to VLMs (vllm-project#5818) * [Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414) * add pin_lora to habana components * add WA for model loader * fix api mismatches with ray * tensor parallel fixes * workers cpu alignment fix * [Misc] Remove useless code in cpu_worker (vllm-project#5824) * prefill/decode metadata fixes * [Core] Add fault tolerance for `RayTokenizerGroupPool` (vllm-project#5748) * re-enable attn metadata trimming * worker_use_ray fix * [doc][distributed] add both gloo and nccl tests (vllm-project#5834) * [CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798) * [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (vllm-project#5794) * [Hardware][TPU] Refactor TPU backend (vllm-project#5831) * [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (vllm-project#5422) * [Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850) * [CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841) * [Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408) Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Stephanie <[email protected]> Co-authored-by: Stephanie <[email protected]> * [Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832) * [bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801) * [Bugfix] Fix embedding to support 2D inputs (vllm-project#5829) * [Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860) * [CI/Build] Refactor image test assets (vllm-project#5821) * [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (vllm-project#5560) Co-authored-by: Chih-Chieh-Yang <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> * [Frontend] Add tokenize/detokenize endpoints (vllm-project#5054) * [Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855) * [Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869) * Support CPU inference with VSX PowerPC ISA (vllm-project#5652) * [doc] update usage of env var to avoid conflict (vllm-project#5873) * [Misc] Add example for LLaVA-NeXT (vllm-project#5879) * [BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875) Co-authored-by: Abhinav Goyal <[email protected]> * [Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887) * [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (vllm-project#5880) Signed-off-by: Xiaowei Jiang <[email protected]> * [Model] Add base class for LoRA-supported models (vllm-project#5018) * [Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888) * [CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526) * add collective crash WA * add comment to the weird mark_step * [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896) * [doc][misc] add note for Kubernetes users (vllm-project#5916) * [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (vllm-project#5876) * [BugFix] Fix `min_tokens` behaviour for multiple eos tokens (vllm-project#5849) * [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (vllm-project#5922) * [Model] Add Gemma 2 (vllm-project#5908) * [core][misc] remove logical block (vllm-project#5882) * [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932) * [Hardware][TPU] Optimize KV cache swapping (vllm-project#5878) * [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (vllm-project#5905) Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956) * [Core] Registry for processing model inputs (vllm-project#5214) Co-authored-by: ywang96 <[email protected]> * Unmark fused_moe config json file as executable (vllm-project#5960) * [Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379) * [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (vllm-project#5894) Signed-off-by: Thomas Parnell <[email protected]> * [CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904) * [Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927) Signed-off-by: Xiaowei Jiang <[email protected]> * [Spec Decode] Introduce DraftModelRunner (vllm-project#5799) * [Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931) * [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * Support Deepseek-V2 (vllm-project#4650) Co-authored-by: Philipp Moritz <[email protected]> * [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (vllm-project#5936) * Unmark more files as executable (vllm-project#5962) * [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628) Co-authored-by: LiuXiaoxuanPKU <[email protected]>, bong-furiosa <[email protected]> * [Bugfix][TPU] Fix TPU sampler output (vllm-project#5978) * [Bugfix][TPU] Fix pad slot id (vllm-project#5977) * [Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926) * [Misc] Extend vLLM Metrics logging API (vllm-project#5925) Co-authored-by: Antoni Baum <[email protected]> * [Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930) Signed-off-by: Joe Runde <[email protected]> * [Bugfix] Fix precisions in Gemma 1 (vllm-project#5913) * [Misc] Update Phi-3-Vision Example (vllm-project#5981) Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix] Support `eos_token_id` from `config.json` (vllm-project#5954) * [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (vllm-project#5974) * [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939) * [ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839) Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [CI/Build] Add TP test for vision models (vllm-project#5892) * [ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949) * [CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989) * [CI/Build] Reuse code for checking output consistency (vllm-project#5988) * [CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966) * [ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (vllm-project#5991) * [Frontend]: Support base64 embedding (vllm-project#5935) Co-authored-by: Cyrus Leung <[email protected]> * [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909) Co-authored-by: sang <[email protected]> * [ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005) Co-authored-by: [email protected] <rshaw@neuralmagic> * [Misc] Fix `get_min_capability` (vllm-project#5971) * [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (vllm-project#5940) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007) * [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348) * Revert test changes * cleanup * llm engine cleanup * utils.py cleanup * custom ops refactor * move xops to ops * remove vllm/hpu/attn_bias.py * whitespace fix * revert accidental changes in rmsnorm * Fix hpugraph hashing * add trim_attn_metadata comment * fix prompt bucketing: * [ CI ] Re-enable Large Model LM Eval (vllm-project#6031) * [doc][misc] remove deprecated api server in doc (vllm-project#6037) * [Misc] update benchmark backend for scalellm (vllm-project#6018) * [doc][misc] further lower visibility of simple api server (vllm-project#6041) Co-authored-by: Simon Mo <[email protected]> * [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (vllm-project#6039) * [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029) * add FAQ doc under 'serving' (vllm-project#5946) * [Bugfix][Doc] Fix Doc Formatting (vllm-project#6048) * [Bugfix] Add explicit `end_forward` calls to flashinfer (vllm-project#6044) * [BugFix] Ensure worker model loop is always stopped at the right time (vllm-project#5987) * [Frontend] Relax api url assertion for openai benchmarking (vllm-project#6046) * [Model] Changes to MLPSpeculator to support tie_weights and input_scale (vllm-project#5965) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Joshua Rosenkranz <[email protected]> * [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (vllm-project#5602) * [Frontend] Add template related params to request (vllm-project#5709) * [VLM] Remove `image_input_type` from VLM config (vllm-project#5852) Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Doc] Reinstate doc dependencies (vllm-project#6061) * guard model loader wa for hpu --------- Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Lei Wen <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Stephanie <[email protected]> Signed-off-by: Xiaowei Jiang <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Jianan Gu <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Allen.Dou <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Sanger Steel <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: leiwen83 <[email protected]> Co-authored-by: Lei Wen <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Amit Garg <[email protected]> Co-authored-by: Charles Riggins <[email protected]> Co-authored-by: Liqian Chen <[email protected]> Co-authored-by: zhyncs <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Bruce Fontaine <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Ronen Schaffer <[email protected]> Co-authored-by: sergey-tinkoff <[email protected]> Co-authored-by: milo157 <[email protected]> Co-authored-by: Shukant Pal <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: DearPlanet <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Joshua Rosenkranz <[email protected]> Co-authored-by: Davis Wertheimer <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Jee Li <[email protected]> Co-authored-by: rohithkrn <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Woo-Yeon Lee <[email protected]> Co-authored-by: Matt Wong <[email protected]> Co-authored-by: aws-patlange <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Stephanie <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: Chip Kerchner <[email protected]> Co-authored-by: Abhinav Goyal <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: wangding zeng <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: LiuXiaoxuanPKU <[email protected]>, bong-furiosa <[email protected]> Co-authored-by: mcalman <[email protected]> Co-authored-by: William Lin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: llmpros <[email protected]> Co-authored-by: sang <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: James Whedbee <[email protected]> Co-authored-by: Joshua Rosenkranz <[email protected]> Co-authored-by: danieljannai21 <[email protected]>
HabanaAI · Jul 2, 2024 · 5e1a565 · 5e1a565
1 parent 90f900c
commit 5e1a565
Show file tree

Hide file tree

Showing 669 changed files with 64,037 additions and 19,646 deletions.
diff --git a/.buildkite/check-wheel-size.py b/.buildkite/check-wheel-size.py
@@ -1,7 +1,7 @@
 import os
 import zipfile
 
-MAX_SIZE_MB = 100
+MAX_SIZE_MB = 200
 
 
 def print_top_10_largest_files(zip_file):

diff --git a/.buildkite/download-images.sh b/.buildkite/download-images.sh
@@ -8,10 +8,6 @@ set -o pipefail
 # aws s3 sync s3://air-example-data-2/vllm_opensource_llava/ images/
 mkdir -p images
 cd images
-wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_pixel_values.pt
-wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_image_features.pt
-wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_pixel_values.pt
-wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_image_features.pt
 wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg
 wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom.jpg
 

diff --git a/.buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml b/.buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
@@ -0,0 +1,11 @@
+# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
+model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.892
+  - name: "exact_match,flexible-extract"
+    value: 0.892
+limit: 250
+num_fewshot: 5
diff --git a/.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-FP8.yaml b/.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-FP8.yaml
@@ -0,0 +1,11 @@
+# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
+model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.756
+  - name: "exact_match,flexible-extract"
+    value: 0.752
+limit: 250
+num_fewshot: 5
diff --git a/.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml b/.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
@@ -0,0 +1,11 @@
+# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
+model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.756
+  - name: "exact_match,flexible-extract"
+    value: 0.752
+limit: 250
+num_fewshot: 5
diff --git a/.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml b/.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
@@ -0,0 +1,11 @@
+# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
+model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.616
+  - name: "exact_match,flexible-extract"
+    value: 0.632
+limit: 250
+num_fewshot: 5
diff --git a/.buildkite/lm-eval-harness/configs/models-large.txt b/.buildkite/lm-eval-harness/configs/models-large.txt
@@ -0,0 +1,2 @@
+Meta-Llama-3-70B-Instruct.yaml
+Mixtral-8x7B-Instruct-v0.1.yaml
diff --git a/.buildkite/lm-eval-harness/configs/models-small.txt b/.buildkite/lm-eval-harness/configs/models-small.txt
@@ -0,0 +1,2 @@
+Meta-Llama-3-8B-Instruct.yaml
+Meta-Llama-3-8B-Instruct-FP8.yaml
diff --git a/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh b/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+# We can use this script to compute baseline accuracy on GSM for transformers.
+#
+# Make sure you have lm-eval-harness installed:
+#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10
+
+usage() {
+    echo``
+    echo "Runs lm eval harness on GSM8k using huggingface transformers."
+    echo "This pathway is intended to be used to create baselines for "
+    echo "our automated nm-test-accuracy workflow"
+    echo
+    echo "usage: ${0} <options>"
+    echo
+    echo "  -m    - huggingface stub or local directory of the model"
+    echo "  -b    - batch size to run the evaluation at"
+    echo "  -l    - limit number of samples to run"
+    echo "  -f    - number of fewshot samples to use"
+    echo
+}
+
+while getopts "m:b:l:f:" OPT; do
+  case ${OPT} in
+    m ) 
+        MODEL="$OPTARG"
+        ;;
+    b ) 
+        BATCH_SIZE="$OPTARG"
+        ;;
+    l ) 
+        LIMIT="$OPTARG"
+        ;;
+    f ) 
+        FEWSHOT="$OPTARG"
+        ;;
+    \? ) 
+        usage
+        exit 1
+        ;;
+  esac
+done
+
+lm_eval --model hf \
+  --model_args pretrained=$MODEL,parallelize=True \
+  --tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
+  --batch_size $BATCH_SIZE
diff --git a/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh b/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+# We can use this script to compute baseline accuracy on GSM for vllm.
+# We use this for fp8, which HF does not support.
+#
+# Make sure you have lm-eval-harness installed:
+#   pip install lm-eval==0.4.2
+
+usage() {
+    echo``
+    echo "Runs lm eval harness on GSM8k using huggingface transformers."
+    echo "This pathway is intended to be used to create baselines for "
+    echo "our automated nm-test-accuracy workflow"
+    echo
+    echo "usage: ${0} <options>"
+    echo
+    echo "  -m    - huggingface stub or local directory of the model"
+    echo "  -b    - batch size to run the evaluation at"
+    echo "  -l    - limit number of samples to run"
+    echo "  -f    - number of fewshot samples to use"
+    echo "  -t    - tensor parallel size to run at"
+    echo
+}
+
+while getopts "m:b:l:f:t:" OPT; do
+  case ${OPT} in
+    m ) 
+        MODEL="$OPTARG"
+        ;;
+    b ) 
+        BATCH_SIZE="$OPTARG"
+        ;;
+    l ) 
+        LIMIT="$OPTARG"
+        ;;
+    f ) 
+        FEWSHOT="$OPTARG"
+        ;;
+    t )
+        TP_SIZE="$OPTARG"
+        ;;
+    \? ) 
+        usage
+        exit 1
+        ;;
+  esac
+done
+
+lm_eval --model vllm \
+  --model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE \
+  --tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
+  --batch_size $BATCH_SIZE
diff --git a/.buildkite/lm-eval-harness/run-tests.sh b/.buildkite/lm-eval-harness/run-tests.sh
@@ -0,0 +1,59 @@
+#!/bin/bash
+
+usage() {
+    echo``
+    echo "Runs lm eval harness on GSM8k using vllm and compares to "
+    echo "precomputed baseline (measured by HF transformers.)"
+    echo
+    echo "usage: ${0} <options>"
+    echo
+    echo "  -c    - path to the test data config (e.g. configs/small-models.txt)"
+    echo "  -t    - tensor parallel size"
+    echo
+}
+
+SUCCESS=0
+
+while getopts "c:t:" OPT; do
+  case ${OPT} in
+    c ) 
+        CONFIG="$OPTARG"
+        ;;
+    t )
+        TP_SIZE="$OPTARG"
+        ;;
+    \? )
+        usage
+        exit 1
+        ;;
+  esac
+done
+
+# Parse list of configs.
+IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
+
+for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
+do
+    LOCAL_SUCCESS=0
+
+    echo "=== RUNNING MODEL: $MODEL_CONFIG WITH TP SIZE: $TP_SIZE==="
+
+    export LM_EVAL_TEST_DATA_FILE=$PWD/configs/${MODEL_CONFIG}
+    export LM_EVAL_TP_SIZE=$TP_SIZE
+    pytest -s test_lm_eval_correctness.py || LOCAL_SUCCESS=$?
+
+    if [[ $LOCAL_SUCCESS == 0 ]]; then
+        echo "=== PASSED MODEL: ${MODEL_CONFIG} ==="
+    else
+        echo "=== FAILED MODEL: ${MODEL_CONFIG} ==="
+    fi
+
+    SUCCESS=$((SUCCESS + LOCAL_SUCCESS))
+
+done
+
+if [ "${SUCCESS}" -eq "0" ]; then
+    exit 0
+else
+    exit 1
+fi
diff --git a/.buildkite/lm-eval-harness/test_lm_eval_correctness.py b/.buildkite/lm-eval-harness/test_lm_eval_correctness.py
@@ -0,0 +1,54 @@
+"""
+LM eval harness on model to compare vs HF baseline computed offline.
+Configs are found in configs/$MODEL.yaml
+
+* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
+* export LM_EVAL_TP_SIZE=4 
+* pytest -s test_lm_eval_correctness.py
+"""
+
+import os
+from pathlib import Path
+
+import lm_eval
+import numpy
+import yaml
+
+RTOL = 0.02
+TEST_DATA_FILE = os.environ.get(
+    "LM_EVAL_TEST_DATA_FILE",
+    ".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
+
+TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)
+
+
+def launch_lm_eval(eval_config):
+    model_args = f"pretrained={eval_config['model_name']}," \
+                 f"tensor_parallel_size={TP_SIZE}"
+
+    results = lm_eval.simple_evaluate(
+        model="vllm",
+        model_args=model_args,
+        tasks=[task["name"] for task in eval_config["tasks"]],
+        num_fewshot=eval_config["num_fewshot"],
+        limit=eval_config["limit"],
+        batch_size="auto")
+
+    return results
+
+
+def test_lm_eval_correctness():
+    eval_config = yaml.safe_load(
+        Path(TEST_DATA_FILE).read_text(encoding="utf-8"))
+
+    # Launch eval requests.
+    results = launch_lm_eval(eval_config)
+
+    # Confirm scores match ground truth.
+    for task in eval_config["tasks"]:
+        for metric in task["metrics"]:
+            ground_truth = metric["value"]
+            measured_value = results["results"][task["name"]][metric["name"]]
+            print(f'{task["name"]} | {metric["name"]}: '
+                  f'ground_truth={ground_truth} | measured={measured_value}')
+            assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
diff --git a/.buildkite/nightly-benchmarks/README.md b/.buildkite/nightly-benchmarks/README.md
@@ -0,0 +1,103 @@
+# vLLM benchmark suite
+
+## Introduction
+
+This directory contains the performance benchmarking CI for vllm.
+The goal is to help developers know the impact of their PRs on the performance of vllm.
+
+This benchmark will be *triggered* upon:
+- A PR being merged into vllm.
+- Every commit for those PRs with `perf-benchmarks` label.
+
+**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.
+
+**Benchmarking Duration**: about 1hr.
+
+**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.
+
+
+## Configuring the workload
+
+The benchmarking workload contains three parts:
+- Latency tests in `latency-tests.json`.
+- Throughput tests in `throughput-tests.json`.
+- Serving tests in `serving-tests.json`.
+
+See [descriptions.md](tests/descriptions.md) for detailed descriptions. 
+
+### Latency test
+
+Here is an example of one test inside `latency-tests.json`:
+
+```json
+[
+    {
+        "test_name": "latency_llama8B_tp1",
+        "parameters": {
+            "model": "meta-llama/Meta-Llama-3-8B",
+            "tensor_parallel_size": 1,
+            "load_format": "dummy",
+            "num_iters_warmup": 5,
+            "num_iters": 15
+        }
+    },
+]
+```
+
+In this example:
+-  The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
+-  The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
+
+Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
+
+WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
+
+
+### Throughput test
+The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
+
+The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
+
+### Serving test
+We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
+
+```
+[
+    {
+        "test_name": "serving_llama8B_tp1_sharegpt",
+        "qps_list": [1, 4, 16, "inf"],
+        "server_parameters": {
+            "model": "meta-llama/Meta-Llama-3-8B",
+            "tensor_parallel_size": 1,
+            "swap_space": 16,
+            "disable_log_stats": "",
+            "disable_log_requests": "",
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Meta-Llama-3-8B",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+]
+```
+
+Inside this example:
+- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
+- The `server-parameters` includes the command line arguments for vLLM server.
+- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
+- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
+
+The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.
+
+WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
+
+## Visualizing the results
+The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
+You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
+If you do not see the table, please wait till the benchmark finish running.
+The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
+The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Meta-Llama-3-70B-Instruct.yaml
		Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Meta-Llama-3-8B-Instruct.yaml
		Meta-Llama-3-8B-Instruct-FP8.yaml