[Roadmap] vLLM Roadmap Q2 2024 #3861

simon-mo · 2024-04-04T22:38:01Z

This document includes the features in vLLM's roadmap for Q2 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.

You can see our historical roadmap at #2681, #244. This roadmap contains work committed by the vLLM team from UC Berkeley, as well as the broader vLLM contributor groups including but not limited to Anyscale, IBM, NeuralMagic, Roblox, Oracle Cloud. You can also find help wanted items in this roadmap as well! Additionally, this roadmap is shaped by you, our user community!

Themes.

We categorized our roadmap into 6 broad themes:

Broad model support: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs.
Excellent hardware coverage: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip.
Performance optimization:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
Production level engine: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service.
Strong OSS product: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
Extensible architectures: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.

Broad Model Support

Encoder Decoder Models
- T5 Add Encoder-decoder model support and T5 Model support #3117
- Whisper
- Embedding Supporting embedding models #3187
Hybrid Architecture (Jamba) [New Model]: Jamba (MoE Mamba from AI21) #3690
Decoder Only Embedding Models [Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734
Prefix tuning support

Help Wanted:

More vision transformers beyond llava
Support private model registration How to serve a private HF model? #172
Control vector support [Feature]: Control vectors #3451
Fallback support for arbitrary transformers text generation model
Long context investigation of LongRoPE
RWKV

Excellent Hardware Coverage

Performance Optimization

Speculative decoding
- Speculative decoding framework for top-1 proposals w/draft model
- Proposer improvement: Prompt-lookup n-gram speculations
- Scoring improvement: Make batch expansion optional
- Scoring improvement: dynamic scoring length policy
Kernels:
- FlashInfer integration Import FlashInfer: 3x faster PagedAttention than vLLM #2767
- Sampler optimizations leveraging triton compiler
Quantization:
- FP8 format support for NVIDIA Ammo and AMD Quantizer
- Weight only quantization (Marlin) improvements: act_order, int8, Exllama2 compatibility, fused MoE, AWQ kernels.
- Activation quantization (W8A8, FP8, etc)
- Quantized lora support Add Support for QLORA/QA-QLORA weights which are not merged #3225
- AQLM quantization

Constrained decoding performance (batch, async, acceleration) and extensibility (Outlines [Feature]: Update Outlines Integration from FSM to Guide #3715, LMFormatEnforcer [Feature]: Integrate with lm-format-enforcer #3713, AICI AI Controller Interface (AICI) integration #2888 )

Help Wanted:

Sparse kv cache (H2O, compression, FastDecode)
Speculative decoding
- Proposer/scoring/verifier improvement: Top-k “tree attention” proposals for Eagle/Medusa/Draft model
- Proposer improvement: RAG n-gram speculations
- Proposer improvement: Eagle/Medusa top-1 proposals
- Proposer improvement: Quantized draft models
- Verifier improvement: Typical acceptance

Production Level Engine

Scheduling
- Prototype Disaggregated prefill (How to use Splitwise(from microsoft) in vllm? #2370)
- Speculative decoding fully merged in ([WIP] Speculative decoding using a draft model #2188)
- Turn chunked prefill/sarathi/splitfuse on by default ([2/N] Chunked prefill data update #3538)
Memory management
- Automatic prefix caching enhancement

TGI feature parity (stop string handling, logging and metrics, test improvements)
Provide non-ray option for single node inference
Optimize api server performance
OpenAI server feature completeness (function calling) (OpenAI Tools / function calling v2 #3237)
Model Loading
- Optimize model weights loading by directly loading from hub/s3 [Feature]: Add model loading using CoreWeave's tensorizer #3533
- Fully offline mode

Help Wanted:

Logging serving FLOPs for performance analysis
Dynamic LoRA adapter downloads from hub/S3

Strong OSS Product

Continuous benchmarks (resource needed!)
Commit to 2wk release cadence
Growing reviewer and committer base

Better docs
- doc: memory and performance tuning guide
- doc: apc documentation
- doc: hardware support levels, feature matrix, and policies
- doc: guide to horizontally scale up vLLM service
- doc: developer guide for adding new draft based models or draft-less optimizations

Automatic CD of nightly wheels and docker images

Help Wanted:

ARM aarch-64 support for AWS Graviton based instances and GH200
Full correctness test with HuggingFace transformers. Resources needed.
Well tested support for lm-eval-harness (logprobs, get tokenizers)
Local development workflow without cuda

Extensible Architecture

Prototype pipeline parallelism
Extensible memory manager
Extensible scheduler
torch.compile investigations
- use compile for quantization kernel fusion
- use compile for future proofing graph mode
- use compile for xpu or other accelerators
Architecture for queue management and request prioritization
Streaming LLM, prototype it on new block manager
Investigate Tensor + Pipeline parallelism (LIGER)

The text was updated successfully, but these errors were encountered:

Jeffwan · 2024-04-05T00:48:17Z

@simon-mo for prefill disaggregation. from the splitwise and distserve paper, they all build solution on top of vLLM for evaluation. Any contribution from these teams? is vLLM community open for public contribution for this feature?

simon-mo · 2024-04-05T00:50:27Z

@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well.

kanseaveg · 2024-04-05T02:50:41Z

Any update for PEFT?

please consider support huggingface peft, thank you. #1129

simon-mo · 2024-04-05T05:25:04Z

Hi @kanseaveg, we do support LoRA and planning to add prefix tuning support, which should allow Hugging face PEFT model format. Which PEFT methods are you interested in?

kanseaveg · 2024-04-05T05:29:45Z

@simon-mo Thank you very much for your reply.There are three common types of tuning methods that I am currently concerned about:

prefix-tuning / p-tuning v2
adapter-tuning
lora-tuning (currently supported)
I hope the vllm framework can support this, which is what I mentioned in Q3 last year and Q1 this year.
Thank you very much for your reply.

accupham · 2024-04-05T13:05:30Z

Maybe consider supporting QuaRot quantization scheme?

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: this https URL.

I think this would be huge for larger models like Command-R+ (104B) being able to fit into a single 80G A100 with negligible performance losses.

zbloss · 2024-04-05T16:35:41Z

Very excited to see both Embedding models and CPU support on the roadmap!

These being implemented would make vLLM my default model serving engine.

sangstar · 2024-04-05T17:35:23Z

Very excited to see that the tensorizer PR is in this roadmap! Sorry about all the pings, I'm just passionate about getting this to vLLM users :D More than happy to be of any assistance in getting that feature implemented :)

PenutChen · 2024-04-08T01:28:34Z

Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000

yukavio · 2024-04-09T07:56:37Z

I'm very interested in implementing tree attention for speculative decoding. @simon-mo

jeejeelee · 2024-04-12T10:50:49Z

Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000

#4015 had done this

qZhang88 · 2024-04-19T14:14:03Z

Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000

#4015 had done this

This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem, When using LoRA, vocab size must be 32000 >= vocab_size <= 33024, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?

jeejeelee · 2024-04-19T15:24:57Z

Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000

#4015 had done this

This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem, When using LoRA, vocab size must be 32000 >= vocab_size <= 33024, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?

the function create_lora_weights from LogitsProcessorWithLoRA throws this error.

Model using llama architecture designate lm_head as a target module for lora, and need instantiate LogitsProcessorWithLoRA，refer to: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#438

Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiate LogitsProcessorWithLoRA

qZhang88 · 2024-04-20T01:47:32Z

Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000

#4015 had done this

This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem, When using LoRA, vocab size must be 32000 >= vocab_size <= 33024, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?

the function create_lora_weights from LogitsProcessorWithLoRA throws this error.

Model using llama architecture designate lm_head as a target module for lora, and need instantiate LogitsProcessorWithLoRA，refer to: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#438

Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiate LogitsProcessorWithLoRA

I see, but lm_head is not finetuned during lora, so there is no need to replace logits_processor. In my adapter_config.json, target_modules does not contains lm_head

  "target_modules": [
    "gate_proj",
    "v_proj",
    "q_proj",
    "o_proj",
    "up_proj",
    "k_proj",
    "down_proj"
  ],

jeejeelee · 2024-04-20T02:15:45Z

Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000

#4015 had done this

This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem, When using LoRA, vocab size must be 32000 >= vocab_size <= 33024, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?

the function create_lora_weights from LogitsProcessorWithLoRA throws this error.
Model using llama architecture designate lm_head as a target module for lora, and need instantiate LogitsProcessorWithLoRA，refer to: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#438
Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiate LogitsProcessorWithLoRA

I see, but lm_head is not finetuned during lora, so there is no need to replace logits_processor. In my adapter_config.json, target_modules does not contains lm_head
  "target_modules": [
    "gate_proj",
    "v_proj",
    "q_proj",
    "o_proj",
    "up_proj",
    "k_proj",
    "down_proj"
  ],

vllm support multi-lora, whether to replace logits_processor is determined by the model's support_modules, not by the adapter_config.json.

Vermeille · 2024-04-25T10:46:21Z

would like to help with #620

WangErXiao · 2024-05-04T03:10:00Z

@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well.

Looking forward to the release of vllm support for Prefill-Decode Disaggregation feature

colourful-tree · 2024-05-08T06:21:08Z

@simon-mo Hi, How about https://arxiv.org/abs/2404.18057? It seems to have a significant advantage in long sequences, and it does not conflict with page-attention technology.

kanseaveg · 2024-05-10T00:41:07Z

@simon-mo Any thing update about the #3117 ? This issue was raised in February, and it has been nearly three months. We sincerely look forward to your updating in this regard, thank you.

simon-mo · 2024-05-10T00:57:50Z

@simon-mo Any thing update about the #3117 ? This issue was raised in February, and it has been nearly three months. We sincerely look forward to your updating in this regard, thank you.

Still in progress. @robertgshaw2-neuralmagic can help comment more.

zxy-zzz · 2024-05-11T09:11:33Z

Do you have plans to incorporate RISC-V or ARM CPU backends into the vLLM project? Thank you.

robertgshaw2-neuralmagic · 2024-05-17T15:29:42Z

We should consider long-context optimizations for Q3.

e.g. things like https://github.com/feifeibear/long-context-attention

sumukshashidhar · 2024-05-19T01:37:30Z

Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies)

If not - I can help with this

johnsonwag03 · 2024-05-19T13:43:32Z

do you have plan to support nvidia device jetson with aarch64 ?

robertgshaw2-neuralmagic · 2024-05-19T14:51:08Z

Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies)

If not - I can help with this

Are you thinking this would be something handled internally by LLMEngine or a new front end that stands in front?

If handled internally, this will require significant changes to the core logic.

Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP

If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or ray

sumukshashidhar · 2024-05-19T16:49:44Z

Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies)
If not - I can help with this

Are you thinking this would be something handled internally by LLMEngine or a new front end that stands in front?

If handled internally, this will require significant changes to the core logic.

Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP

If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or ray

My particular use-case is automatic large offline batches, for which I have a hotfix - I spin up multiple OpenAI servers, and distribute the prompts among them. Curiously, I see large speedups when I do this, as opposed to TP.

Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP.

I'm not sure if this is a bug or something else, because I did indeed see large speedups with this, when I completely removed ray worker communication (some digging said that the overhead is not worth it). If this is not expected, I can try out some experiments and post them here. (This may be an artifact of me having a PCIE GPU cluster, not sped up by NVLINK)

robertgshaw2-neuralmagic · 2024-05-19T16:53:53Z

Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies)
If not - I can help with this

Are you thinking this would be something handled internally by LLMEngine or a new front end that stands in front?
If handled internally, this will require significant changes to the core logic.
Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP
If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or ray

My particular use-case is automatic large offline batches, for which I have a hotfix - I spin up multiple OpenAI servers, and distribute the prompts among them. Curiously, I see large speedups when I do this, as opposed to TP.

Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP.

I'm not sure if this is a bug or something else, because I did indeed see large speedups with this, when I completely removed ray worker communication (some digging said that the overhead is not worth it). If this is not expected, I can try out some experiments and post them here. (This may be an artifact of me having a PCIE GPU cluster, not sped up by NVLINK)

Okay great. We would welcome a contribution focused on the offline batch processing case.

Could you make an RFC issue to discuss a potential design? I think we should try hard to not modify LLMEngine and see if we can handle things in the LLM class

fenggwsx · 2024-05-31T13:31:08Z

Very excited to see function calling support in OpenAI-Compatible server is in this roadmap!
This is quite helpful when using LangChain.

irasin · 2024-06-03T07:57:22Z

@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well.

Hi @simon-mo. Is there any update about splitwise? It seems that the development of #2809 has stopped.

K-Mistele · 2024-06-16T21:13:48Z

Would love to see updates to the docs on how to use supported vision models, embedding models, and the new support for tools with forced tool choice (auto tool choice is still WIP as I understand)

cason0126 · 2024-06-19T08:50:47Z

Hi @simon-mo , is there any plan to support Huawei's NPU HardWare ?

CSEEduanyu · 2024-06-23T06:52:19Z

Hi @simon-mo , is there any plan to support Huawei's NPU HardWare ?
@simon-mo Some company with no moral bottom line, don't have anything to do with them。

simon-mo · 2024-06-25T00:08:31Z

Q3 published here #5805

2-fly-4-ai · 2024-08-17T04:32:53Z

Is function calling available yet?

K-Mistele · 2024-08-17T19:46:26Z

Is function calling available yet?

Soon, for Hermes and mistral models in #5649

If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1

githebs · 2024-08-19T13:07:09Z

Is function calling available yet?

Soon, for Hermes and mistral models in #5649

If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1

So initially Llama not included ? Thanks

joshdevins · 2024-09-04T10:58:11Z

If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1

@K-Mistele Is there a PR or issue I can follow for function calling support with Llama 3.1 (70B specifically)?

K-Mistele · 2024-09-04T15:41:22Z

If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1

@K-Mistele Is there a PR or issue I can follow for function calling support with Llama 3.1 (70B specifically)?

There is a branch on my vLLM fork, but not a PR yet since #5649 needs to be merged before I open another PR based on it.

simon-mo added misc and removed misc labels Apr 4, 2024

simon-mo mentioned this issue Apr 4, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

simon-mo pinned this issue Apr 4, 2024

yukavio mentioned this issue Apr 10, 2024

[Feature]: Tree attention about Speculative Decoding #3960

Open

DarkLight1337 mentioned this issue Apr 11, 2024

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Closed

7 tasks

tlrmchlsmth mentioned this issue May 13, 2024

[Kernel] Add w8a8 CUTLASS kernels #4749

Merged

simon-mo closed this as completed Jun 25, 2024

simon-mo unpinned this issue Jun 25, 2024

agm-eratosth mentioned this issue Aug 21, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

46 tasks

simon-mo mentioned this issue Oct 1, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

40 tasks

[Roadmap] vLLM Roadmap Q2 2024 #3861

[Roadmap] vLLM Roadmap Q2 2024 #3861

Comments

simon-mo commented Apr 4, 2024 • edited Loading

Themes.

Broad Model Support

Excellent Hardware Coverage

Performance Optimization

Production Level Engine

Strong OSS Product

Extensible Architecture

Jeffwan commented Apr 5, 2024

simon-mo commented Apr 5, 2024

kanseaveg commented Apr 5, 2024

simon-mo commented Apr 5, 2024

kanseaveg commented Apr 5, 2024 • edited Loading

accupham commented Apr 5, 2024

zbloss commented Apr 5, 2024

sangstar commented Apr 5, 2024 • edited Loading

PenutChen commented Apr 8, 2024

yukavio commented Apr 9, 2024

jeejeelee commented Apr 12, 2024

qZhang88 commented Apr 19, 2024

jeejeelee commented Apr 19, 2024 • edited Loading

qZhang88 commented Apr 20, 2024 • edited Loading

jeejeelee commented Apr 20, 2024

Vermeille commented Apr 25, 2024

WangErXiao commented May 4, 2024

colourful-tree commented May 8, 2024

kanseaveg commented May 10, 2024 • edited Loading

simon-mo commented May 10, 2024

zxy-zzz commented May 11, 2024

robertgshaw2-neuralmagic commented May 17, 2024

sumukshashidhar commented May 19, 2024

johnsonwag03 commented May 19, 2024

robertgshaw2-neuralmagic commented May 19, 2024

sumukshashidhar commented May 19, 2024

robertgshaw2-neuralmagic commented May 19, 2024

fenggwsx commented May 31, 2024

irasin commented Jun 3, 2024

K-Mistele commented Jun 16, 2024

cason0126 commented Jun 19, 2024

CSEEduanyu commented Jun 23, 2024 • edited Loading

simon-mo commented Jun 25, 2024

2-fly-4-ai commented Aug 17, 2024

K-Mistele commented Aug 17, 2024

githebs commented Aug 19, 2024

joshdevins commented Sep 4, 2024

K-Mistele commented Sep 4, 2024

simon-mo commented Apr 4, 2024 •

edited

Loading

kanseaveg commented Apr 5, 2024 •

edited

Loading

sangstar commented Apr 5, 2024 •

edited

Loading

jeejeelee commented Apr 19, 2024 •

edited

Loading

qZhang88 commented Apr 20, 2024 •

edited

Loading

kanseaveg commented May 10, 2024 •

edited

Loading

CSEEduanyu commented Jun 23, 2024 •

edited

Loading