Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core]Add New Run:ai Streamer Load format. #9941

Closed
wants to merge 5 commits into from

Conversation

pandyamarut
Copy link
Contributor

@pandyamarut pandyamarut commented Nov 1, 2024

Add RunAI Model Streamer Support for Efficient Model Loading

Description

This PR adds support for Run:ai Model Streamer in vLLM, enabling more efficient model loading by streaming weights directly to GPU memory. The RunAI Streamer offers significant performance improvements through:

  • Direct CPU to GPU streaming
  • Multi-threaded loading with configurable concurrency
  • Efficient memory management
  • Support for both local and S3 storage

Implementation Details

  • Added RunAIStreamerLoader class inheriting from BaseModelLoader
  • Added RUNAI_STREAMER to LoadFormat enum
  • Implements direct-to-GPU tensor streaming
  • Supports both single and sharded safetensors files
  • Handles HuggingFace model downloads
  • Configurable thread count for parallel loading
  • S3 integration with customizable settings

Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Copy link

github-actions bot commented Nov 1, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Contributor

@omer-dayan omer-dayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
However, there are few things worth notice, I have commented.

Regarding S3, because supporting models that are stored in S3 requires wider change I suggest the following options:

  • Support S3 weight files using extra_config
    • vllm serve meta/llama2 --load-format runai_streamer --loader-extra-config '{model_weights: "s3://my/path"}'
    • Config.json / tokenizer.json / etc will be downloaded from HF / locally
    • Model weights will be loaded from s3
  • Separating it to 2 PRs
    • One - Loading local files / HF downloaded files using Runai
    • Second - Support S3 full model initialization (A bit more complidated, not sure if the maintainer would like to have that)

IMHO I would do the second, will keep this PR small and valuable, and right after that, have other PR for full support of S3.

Disclaimer: I am one of the developers of the runai-model-streamer

@@ -789,6 +789,7 @@ class LoadFormat(str, enum.Enum):
GGUF = "gguf"
BITSANDBYTES = "bitsandbytes"
MISTRAL = "mistral"
RUNAI_STREAMER = "runai_streamer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add it the help for the --load-format flag as well:
https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L279

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe add the package to the requirements.txt?
https://github.com/vllm-project/vllm/blob/main/requirements-build.txt

)

# Always set memory limit to unlimited for maximum performance
os.environ["RUNAI_STREAMER_MEMORY_LIMIT"] = "-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion we should give the user the ability to decide.
We can make unlimited by default, but still giving the user ability to tune it through model_loader_extra_config

self.concurrency = extra_config.get("concurrency", self.concurrency)

# S3 specific configurations
if "s3_config" in extra_config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 / Compatible object store will not work for the following reasons:

  1. We need to instruct the user to install s3 dep pip install runai-model-streamer[s3]
  2. Starting vLLM like vllm serve s3://path/llama will require downloading files (config.json / tokenizer.json / etc) from the s3, which is not supported today in vLLM

streamer.stream_file(model_path)
for name, tensor in streamer.get_tensors():
# Stream directly to target device
tensor = tensor.to(device, non_blocking=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this line, as the right next line (yield name, tensor) will immediately yield the tensor to the model class, which in turn perform the same copy as in here.

What I am saying is that what you have in here is:
Storage -> CPU Memory -> GPU Memory (Current line) -> GPU Memory(Destination pre-allocated tensor)

Which can be reduced to:
Storage -> CPU Memory -> GPU Memory(Destination pre-allocated tensor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Review @omer-dayan I will make these changes. Thanks

@coolkp
Copy link
Contributor

coolkp commented Nov 17, 2024

Is there benchmark comparison between this and safetensor?

Copy link
Contributor

@omer-dayan omer-dayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coolkp Yes there is,
You can find it here: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf

BTW, we have a different PR (#10192) that add is

@pandyamarut
Copy link
Contributor Author

@omer-dayan , Thanks for opening up an another PR. #10192, I am closing this. Thanks for your contribution, can't wait to try it with vllm soon.

@pandyamarut pandyamarut reopened this Nov 30, 2024
@DarkLight1337
Copy link
Member

Closing as superseded by #10192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants