[Core]Add New Run:ai Streamer Load format. #9941

pandyamarut · 2024-11-01T23:22:09Z

Add RunAI Model Streamer Support for Efficient Model Loading

Description

This PR adds support for Run:ai Model Streamer in vLLM, enabling more efficient model loading by streaming weights directly to GPU memory. The RunAI Streamer offers significant performance improvements through:

Direct CPU to GPU streaming
Multi-threaded loading with configurable concurrency
Efficient memory management
Support for both local and S3 storage

Implementation Details

Added RunAIStreamerLoader class inheriting from BaseModelLoader
Added RUNAI_STREAMER to LoadFormat enum
Implements direct-to-GPU tensor streaming
Supports both single and sharded safetensors files
Handles HuggingFace model downloads
Configurable thread count for parallel loading
S3 integration with customizable settings

Signed-off-by: pandyamarut <[email protected]>

github-actions · 2024-11-01T23:22:21Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

omer-dayan

Looks great!
However, there are few things worth notice, I have commented.

Regarding S3, because supporting models that are stored in S3 requires wider change I suggest the following options:

Support S3 weight files using extra_config
- vllm serve meta/llama2 --load-format runai_streamer --loader-extra-config '{model_weights: "s3://my/path"}'
- Config.json / tokenizer.json / etc will be downloaded from HF / locally
- Model weights will be loaded from s3
Separating it to 2 PRs
- One - Loading local files / HF downloaded files using Runai
- Second - Support S3 full model initialization (A bit more complidated, not sure if the maintainer would like to have that)

IMHO I would do the second, will keep this PR small and valuable, and right after that, have other PR for full support of S3.

Disclaimer: I am one of the developers of the runai-model-streamer

omer-dayan · 2024-11-03T06:41:38Z

vllm/config.py

@@ -789,6 +789,7 @@ class LoadFormat(str, enum.Enum):
    GGUF = "gguf"
    BITSANDBYTES = "bitsandbytes"
    MISTRAL = "mistral"
+    RUNAI_STREAMER = "runai_streamer"


You can add it the help for the --load-format flag as well:
https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L279

And maybe add the package to the requirements.txt?
https://github.com/vllm-project/vllm/blob/main/requirements-build.txt

omer-dayan · 2024-11-03T06:43:05Z

vllm/model_executor/model_loader/loader.py

+            )
+
+        # Always set memory limit to unlimited for maximum performance
+        os.environ["RUNAI_STREAMER_MEMORY_LIMIT"] = "-1"


In my opinion we should give the user the ability to decide.
We can make unlimited by default, but still giving the user ability to tune it through model_loader_extra_config

omer-dayan · 2024-11-03T06:47:46Z

vllm/model_executor/model_loader/loader.py

+            self.concurrency = extra_config.get("concurrency", self.concurrency)
+
+            # S3 specific configurations
+            if "s3_config" in extra_config:


S3 / Compatible object store will not work for the following reasons:

We need to instruct the user to install s3 dep pip install runai-model-streamer[s3]

Starting vLLM like vllm serve s3://path/llama will require downloading files (config.json / tokenizer.json / etc) from the s3, which is not supported today in vLLM

omer-dayan · 2024-11-03T06:53:10Z

vllm/model_executor/model_loader/loader.py

+                streamer.stream_file(model_path)
+                for name, tensor in streamer.get_tensors():
+                    # Stream directly to target device
+                    tensor = tensor.to(device, non_blocking=True)


No need for this line, as the right next line (yield name, tensor) will immediately yield the tensor to the model class, which in turn perform the same copy as in here.

What I am saying is that what you have in here is:
Storage -> CPU Memory -> GPU Memory (Current line) -> GPU Memory(Destination pre-allocated tensor)

Which can be reduced to:
Storage -> CPU Memory -> GPU Memory(Destination pre-allocated tensor)

Thanks for the Review @omer-dayan I will make these changes. Thanks

coolkp · 2024-11-17T21:43:25Z

Is there benchmark comparison between this and safetensor?

omer-dayan

@coolkp Yes there is,
You can find it here: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf

BTW, we have a different PR (#10192) that add is

pandyamarut · 2024-11-30T00:36:46Z

@omer-dayan , Thanks for opening up an another PR. #10192, I am closing this. Thanks for your contribution, can't wait to try it with vllm soon.

DarkLight1337 · 2025-01-10T13:05:59Z

Closing as superseded by #10192

pandyamarut added 5 commits October 29, 2024 23:48

run:ai

a59f781

Signed-off-by: pandyamarut <[email protected]>

update

272769b

Signed-off-by: pandyamarut <[email protected]>

update handler

a7ad9f1

Signed-off-by: pandyamarut <[email protected]>

Updated load version which improves the performance

faeca76

Signed-off-by: pandyamarut <[email protected]>

remove code comments

a273601

Signed-off-by: pandyamarut <[email protected]>

omer-dayan reviewed Nov 3, 2024

View reviewed changes

omer-dayan reviewed Nov 19, 2024

View reviewed changes

pandyamarut closed this Nov 30, 2024

pandyamarut reopened this Nov 30, 2024

DarkLight1337 closed this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core]Add New Run:ai Streamer Load format. #9941

[Core]Add New Run:ai Streamer Load format. #9941

pandyamarut commented Nov 1, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 1, 2024

omer-dayan left a comment

omer-dayan Nov 3, 2024

omer-dayan Nov 3, 2024

omer-dayan Nov 3, 2024

omer-dayan Nov 3, 2024

omer-dayan Nov 3, 2024

pandyamarut Nov 4, 2024

coolkp commented Nov 17, 2024

omer-dayan left a comment

pandyamarut commented Nov 30, 2024

DarkLight1337 commented Jan 10, 2025

[Core]Add New Run:ai Streamer Load format. #9941

[Core]Add New Run:ai Streamer Load format. #9941

Conversation

pandyamarut commented Nov 1, 2024 • edited by github-actions bot Loading

Description

Implementation Details

github-actions bot commented Nov 1, 2024

omer-dayan left a comment

Choose a reason for hiding this comment

omer-dayan Nov 3, 2024

Choose a reason for hiding this comment

omer-dayan Nov 3, 2024

Choose a reason for hiding this comment

omer-dayan Nov 3, 2024

Choose a reason for hiding this comment

omer-dayan Nov 3, 2024

Choose a reason for hiding this comment

omer-dayan Nov 3, 2024

Choose a reason for hiding this comment

pandyamarut Nov 4, 2024

Choose a reason for hiding this comment

coolkp commented Nov 17, 2024

omer-dayan left a comment

Choose a reason for hiding this comment

pandyamarut commented Nov 30, 2024

DarkLight1337 commented Jan 10, 2025

pandyamarut commented Nov 1, 2024 •

edited by github-actions bot

Loading