-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core]Add New Run:ai Streamer Load format. #9941
Conversation
Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
However, there are few things worth notice, I have commented.
Regarding S3, because supporting models that are stored in S3 requires wider change I suggest the following options:
- Support S3 weight files using extra_config
vllm serve meta/llama2 --load-format runai_streamer --loader-extra-config '{model_weights: "s3://my/path"}'
Config.json
/tokenizer.json
/ etc will be downloaded from HF / locally- Model weights will be loaded from s3
- Separating it to 2 PRs
- One - Loading local files / HF downloaded files using Runai
- Second - Support S3 full model initialization (A bit more complidated, not sure if the maintainer would like to have that)
IMHO I would do the second, will keep this PR small and valuable, and right after that, have other PR for full support of S3.
Disclaimer: I am one of the developers of the runai-model-streamer
@@ -789,6 +789,7 @@ class LoadFormat(str, enum.Enum): | |||
GGUF = "gguf" | |||
BITSANDBYTES = "bitsandbytes" | |||
MISTRAL = "mistral" | |||
RUNAI_STREAMER = "runai_streamer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add it the help for the --load-format
flag as well:
https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L279
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe add the package to the requirements.txt
?
https://github.com/vllm-project/vllm/blob/main/requirements-build.txt
) | ||
|
||
# Always set memory limit to unlimited for maximum performance | ||
os.environ["RUNAI_STREAMER_MEMORY_LIMIT"] = "-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion we should give the user the ability to decide.
We can make unlimited by default, but still giving the user ability to tune it through model_loader_extra_config
self.concurrency = extra_config.get("concurrency", self.concurrency) | ||
|
||
# S3 specific configurations | ||
if "s3_config" in extra_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 / Compatible object store will not work for the following reasons:
- We need to instruct the user to install s3 dep
pip install runai-model-streamer[s3]
- Starting vLLM like
vllm serve s3://path/llama
will require downloading files (config.json
/tokenizer.json
/ etc) from the s3, which is not supported today in vLLM
streamer.stream_file(model_path) | ||
for name, tensor in streamer.get_tensors(): | ||
# Stream directly to target device | ||
tensor = tensor.to(device, non_blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this line, as the right next line (yield name, tensor
) will immediately yield the tensor to the model class, which in turn perform the same copy as in here.
What I am saying is that what you have in here is:
Storage -> CPU Memory -> GPU Memory (Current line) -> GPU Memory(Destination pre-allocated tensor)
Which can be reduced to:
Storage -> CPU Memory -> GPU Memory(Destination pre-allocated tensor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Review @omer-dayan I will make these changes. Thanks
Is there benchmark comparison between this and safetensor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coolkp Yes there is,
You can find it here: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf
BTW, we have a different PR (#10192) that add is
@omer-dayan , Thanks for opening up an another PR. #10192, I am closing this. Thanks for your contribution, can't wait to try it with vllm soon. |
Closing as superseded by #10192 |
Add RunAI Model Streamer Support for Efficient Model Loading
Description
This PR adds support for Run:ai Model Streamer in vLLM, enabling more efficient model loading by streaming weights directly to GPU memory. The RunAI Streamer offers significant performance improvements through:
Implementation Details
RunAIStreamerLoader
class inheriting fromBaseModelLoader
RUNAI_STREAMER
toLoadFormat
enum