Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032) Co-authored-by: Dipika <[email protected]> * [Frontend] Expose revision arg in OpenAI server (vllm-project#8501) * [BugFix] Fix clean shutdown issues (vllm-project#8492) * [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506) * [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270) * [doc] update doc on testing and debugging (vllm-project#8514) * [Bugfix] Bind api server port before starting engine (vllm-project#8491) * [perf bench] set timeout to debug hanging (vllm-project#8516) * [misc] small qol fixes for release process (vllm-project#8517) * [Bugfix] Fix 3.12 builds on main (vllm-project#8510) Signed-off-by: Joe Runde <[email protected]> * [refactor] remove triton based sampler (vllm-project#8524) * [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525) Signed-off-by: Alex-Brooks <[email protected]> * [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521) * [torch.compile] register allreduce operations as custom ops (vllm-project#8526) * [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509) Signed-off-by: Rui Qiao <[email protected]> * [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495) * [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631) * [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434) * [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515) Co-authored-by: Cyrus Leung <[email protected]> * [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527) * [Bugfix] Fix TP > 1 for new granite (vllm-project#8544) Signed-off-by: Joe Runde <[email protected]> * [doc] improve installation doc (vllm-project#8550) Co-authored-by: Andy Dai <[email protected]> * [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520) * [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012) * [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540) * [Misc] Add argument to disable FastAPI docs (vllm-project#8554) * [CI/Build] Avoid CUDA initialization (vllm-project#8534) * [CI/Build] Update Ruff version (vllm-project#8469) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157) Co-authored-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Simon Mo <[email protected]> * [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199) * [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543) Signed-off-by: Russell Bryant <[email protected]> * [Model] Support Solar Model (vllm-project#8386) Co-authored-by: Michael Goin <[email protected]> * [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380) Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039) * [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572) * [Bugfix] add `dead_error` property to engine client (vllm-project#8574) Signed-off-by: Joe Runde <[email protected]> * [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573) Co-authored-by: [email protected] * [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (vllm-project#8545) * Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593) * [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616) * [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615) * [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584) * [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577) * [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619) * [Doc] Add documentation for GGUF quantization (vllm-project#8618) * Create SECURITY.md (vllm-project#8642) * [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551) * [Misc] guard against change in cuda library name (vllm-project#8609) * [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571) * [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474) * [Core] Support Lora lineage and base model metadata management (vllm-project#6315) * [Model] Add OLMoE (vllm-project#7922) * [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670) * [Bugfix] Validate SamplingParam n is an int (vllm-project#8548) * [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649) * [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556) * [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640) * [Doc] neuron documentation update (vllm-project#8671) Signed-off-by: omrishiv <[email protected]> * [Hardware][AWS] update neuron to 2.20 (vllm-project#8676) Signed-off-by: omrishiv <[email protected]> * [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496) * [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673) * [MISC] add support custom_op check (vllm-project#8557) Co-authored-by: youkaichao <[email protected]> * [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675) * [beam search] add output for manually checking the correctness (vllm-project#8684) * [Kernel] Build flash-attn from source (vllm-project#8245) * [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687) * [Doc] Fix typo in AMD installation guide (vllm-project#8689) * [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646) * [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518) * [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643) * [Bugfix] Refactor composite weight loading logic (vllm-project#8656) * [ci][build] fix vllm-flash-attn (vllm-project#8699) * [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407) * [Misc] Use NamedTuple in Multi-image example (vllm-project#8705) Signed-off-by: Alex-Brooks <[email protected]> * [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703) * [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486) Co-authored-by: litianjian <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701) * [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713) * [misc] upgrade mistral-common (vllm-project#8715) * [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702) * [Bugfix] Fix CPU CMake build (vllm-project#8723) Co-authored-by: Yuan <[email protected]> * [Bugfix] fix docker build for xpu (vllm-project#8652) * [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657) Signed-off-by: Alex-Brooks <[email protected]> * [Hardware][CPU] Refactor CPU model runner (vllm-project#8729) * [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733) * [Model] Support pp for qwen2-vl (vllm-project#8696) * [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707) * [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738) Co-authored-by: youkaichao <[email protected]> * [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701) Co-authored-by: mgoin <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Kernel][LoRA] Add assertion for punica sgmv kernels (vllm-project#7585) * [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575) Signed-off-by: Russell Bryant <[email protected]> * Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562) * Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335) * [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674) * Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728) * re-implement beam search on top of vllm core (vllm-project#8726) Co-authored-by: Brendan Wong <[email protected]> * Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750) * [MISC] Skip dumping inputs when unpicklable (vllm-project#8744) * [Core][Model] Support loading weights by ID within models (vllm-project#7931) * [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558) * [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661) Co-authored-by: mgoin <[email protected]> * [Frontend] Batch inference for llm.chat() API (vllm-project#8648) Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748) * [CI/Build] fix setuptools-scm usage (vllm-project#8771) * [misc] soft drop beam search (vllm-project#8763) * [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768) * [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047) Signed-off-by: Travis Johnson <[email protected]> * [Core] Adding Priority Scheduling (vllm-project#5958) * [Bugfix] Use heartbeats instead of health checks (vllm-project#8583) * Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780) * [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776) * Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752) * [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250) * [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770) * [Bugfix] load fc bias from config for eagle (vllm-project#8790) --------- Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: omrishiv <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Dipika <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Kevin Lin <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: chenqianfzh <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Andy Dai <[email protected]> Co-authored-by: Alexey Kondratiev(AMD) <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Geun, Lim <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: 盏一 <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Amit Garg <[email protected]> Co-authored-by: William Lin <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: saumya-saran <[email protected]> Co-authored-by: Pastel! <[email protected]> Co-authored-by: omrishiv <[email protected]> Co-authored-by: zyddnys <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Huazhong Ji <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: jiqing-feng <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Brendan Wong <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Peter Salas <[email protected]> Co-authored-by: Hanzhi Zhou <[email protected]> Co-authored-by: Andy <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Archit Patke <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: sohamparikh <[email protected]>
- Loading branch information