-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] vLLM Roadmap Q2 2024 #3861
Comments
@simon-mo for prefill disaggregation. from the splitwise and distserve paper, they all build solution on top of vLLM for evaluation. Any contribution from these teams? is vLLM community open for public contribution for this feature? |
@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well. |
Any update for PEFT? please consider support huggingface peft, thank you. #1129 |
Hi @kanseaveg, we do support LoRA and planning to add prefix tuning support, which should allow Hugging face PEFT model format. Which PEFT methods are you interested in? |
@simon-mo Thank you very much for your reply.There are three common types of tuning methods that I am currently concerned about:
|
Maybe consider supporting QuaRot quantization scheme?
I think this would be huge for larger models like Command-R+ (104B) being able to fit into a single 80G A100 with negligible performance losses. |
Very excited to see both Embedding models and CPU support on the roadmap! These being implemented would make vLLM my default model serving engine. |
Very excited to see that the |
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000 |
I'm very interested in implementing tree attention for speculative decoding. @simon-mo |
This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem, |
the function Model using llama architecture designate lm_head as a target module for lora, and need instantiate Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiate |
I see, but lm_head is not finetuned during lora, so there is no need to replace
|
vllm support multi-lora, whether to replace |
would like to help with #620 |
Looking forward to the release of vllm support for Prefill-Decode Disaggregation feature |
@simon-mo Hi, How about https://arxiv.org/abs/2404.18057? It seems to have a significant advantage in long sequences, and it does not conflict with page-attention technology. |
Still in progress. @robertgshaw2-neuralmagic can help comment more. |
Do you have plans to incorporate RISC-V or ARM CPU backends into the vLLM project? Thank you. |
We should consider long-context optimizations for Q3.
|
Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies) If not - I can help with this |
do you have plan to support nvidia device jetson with aarch64 ? |
Are you thinking this would be something handled internally by If handled internally, this will require significant changes to the core logic. Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or ray |
My particular use-case is automatic large offline batches, for which I have a hotfix - I spin up multiple OpenAI servers, and distribute the prompts among them. Curiously, I see large speedups when I do this, as opposed to TP.
I'm not sure if this is a bug or something else, because I did indeed see large speedups with this, when I completely removed ray worker communication (some digging said that the overhead is not worth it). If this is not expected, I can try out some experiments and post them here. (This may be an artifact of me having a PCIE GPU cluster, not sped up by NVLINK) |
Okay great. We would welcome a contribution focused on the offline batch processing case. Could you make an RFC issue to discuss a potential design? I think we should try hard to not modify LLMEngine and see if we can handle things in the LLM class |
Very excited to see function calling support in OpenAI-Compatible server is in this roadmap! |
Would love to see updates to the docs on how to use supported vision models, embedding models, and the new support for tools with forced tool choice (auto tool choice is still WIP as I understand) |
Hi @simon-mo , is there any plan to support Huawei's NPU HardWare ? |
Q3 published here #5805 |
Is function calling available yet? |
Soon, for Hermes and mistral models in #5649 If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1 |
So initially Llama not included ? Thanks |
@K-Mistele Is there a PR or issue I can follow for function calling support with Llama 3.1 (70B specifically)? |
There is a branch on my vLLM fork, but not a PR yet since #5649 needs to be merged before I open another PR based on it. |
This document includes the features in vLLM's roadmap for Q2 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.
You can see our historical roadmap at #2681, #244. This roadmap contains work committed by the vLLM team from UC Berkeley, as well as the broader vLLM contributor groups including but not limited to Anyscale, IBM, NeuralMagic, Roblox, Oracle Cloud. You can also find help wanted items in this roadmap as well! Additionally, this roadmap is shaped by you, our user community!
Themes.
We categorized our roadmap into 6 broad themes:
Broad Model Support
Help Wanted:
transformers
text generation modelExcellent Hardware Coverage
Performance Optimization
FSM
toGuide
#3715, LMFormatEnforcer [Feature]: Integrate with lm-format-enforcer #3713, AICI AI Controller Interface (AICI) integration #2888 )Help Wanted:
Production Level Engine
tensorizer
#3533Help Wanted:
Strong OSS Product
Help Wanted:
lm-eval-harness
(logprobs, get tokenizers)Extensible Architecture
torch.compile
investigationsThe text was updated successfully, but these errors were encountered: