[RFC]: Hardware pluggable #11162

wangxiyuan · 2024-12-13T07:55:55Z

Motivation.

Currently, vLLM support many hardware backend(cpu, cuda, hpu, neuron, openvino, rocm, tpu, xpu). Some other backend are also eager to be integrated by vllm(ascend, IBM Spyre).

But as VLLM’s backend is more and more, we have encountered some problems:

Each backend has its own executor, worker, runner, attention. It makes the code complex. we can see many backend specified code is left here and there.
It's not easy for community to make the backend keep working. For example, it needs fully CI coverage, maintainers continuous contribution and so on.
New features are hard to be added to vLLM as well, since the backend case is complex.

To solve the problem, a good solution is to support hardware pluggable. There are some benefit:

The backend decoupling can make the code cleaner and easier to maintain
Developers can pay more attention to the generic feature, so that it is no longer troubled by the tedious backend category
Each backend can evolve by itself to ensure availability and realtime integration.

Proposed Change.

There are two related RFC before: #7131 and #9268.

#7131 (Done) added generic plugin system into vLLM.
#9268 (In progress) tries to make backend code modular and decouple.

These two RFC helps hardware pluggable to be implemented easier.

Pluggable

from #7131, vLLM now support out-of-tree plugin ability, developers can integrate his own code into vLLM easily as below.

But, what object can be pluggable is not fully defined and supported. Currently, only Models support this mechanism base on ModelRegistry feature. The out-of-tree code would like:

from vllm import ModelRegistry

def register():
    from .my_opt import MyOPTForCausalLM

    if "MyOPTForCausalLM" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model("MyOPTForCausalLM", MyOPTForCausalLM)

So back to this RFC, the hardware plugin can be done in the same way as below.

First, vLLM need mange a backend list and provide register API for out-of-tree code.
Out-of-tree backend plugin call the register API to register the new Backend to vLLM.
Finally, users can use the new backend the same as before.

Usage(The same as before, the only change is to install a new plugin package):

pip install vllm
pip install vllm-ascend-plugin

# The inference will run on ascend npu automatically.
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = ["Hello, my name is",]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Refactor

So what the Platform object should be like? Currently, the backend related object are executor, worker, model_runner, attention, custom ops and device_communicator. Take attention for example, the Platform class should provide an API like get_attention_cls to init attention backend.

Let’s take a look one by one.

Executor
Now, either from V1 Engine or community goal, we want to see the executor be backend agnostic. out-of-tree backend doesn’t need to implement XXXExecutor anymore. See: [core] platform agnostic executor via collective_rpc #11256
Worker, ModelRunner, AttentionBackend
All of these object should be implement in out-of-tree backend, once the XXXPlatform is registered, these XXXWorker, XXXModelRunner, XXXAttentionBackend should be registered as well.
Communicator
Communicator is the same as Worker, ModelRunner, AttentionBackend. The problem in vLLM now is that there is no base interface for communicator. We should implement the base class in vLLM first. See: [Distributed][refactor] Add base class for device-specific communicator #11324
Custom OP
vllm support pytorch custom op already. It genreates som torch namespace like torch.ops._C and so on. Some CPU、GPU ops has been added already. The device is controlled by torch dispatch key. So the out-of-tree plugin can create its custom ops the same with vllm by using another new dispatch key.
The problem is how to load the compile .so file generated by the plugin from vllm. One way is to copy it to vllm when install the plugin in setup.py

Overall, after the refactor, what the out-of-tree plugin need do is to implement its own Worker, ModelRunner, AttentionBackend, Communicator, and then provide its Platform to include these object, then register to vLLM.

Feedback Period.

Everyday

CC List.

@simon-mo @youkaichao @DarkLight1337 @tlrmchlsmth And other maintainers who are interest in.

Any Other Things.

Once the backend plugin is supported, some other things need to be considered as well. For example, how to make sure the backed runs well? How to let users know the hardware support matrix? Is CI/CD a mandatory requirement? How to cooperate with release？ and so on. Here I’d like to start with some topics.

CI/CD

vLLM now use buildkite to run UT and functional test. I notice that buildkite support self host agent. This makes it possible to integrate different hardware for testing. The hardware contributor can donate the hardware resource to community for CI Test.

V1 Engine

V1 now has its own Executor, Worker, Model Runner and Attention. The backend plugin feature needs to be compatible with V1. I’m not sure the roadmap about V1. If V1 would be the default Engine soon, the better way is to do the refactor and refactor work on V1 directly. Otherwise, work on V0 and migrate to V1 should be good.

Plugin location

Once the backend plugin is supported, the repo can be located anywhere. TBH, vLLM community may do not care it. But for the long-term consideration of the vLLM ecosystem, it is best to have a specification for backend access and maintenance. The backend can be maintained in vllm-project, but there are necessary requirements:

Hardware CI/CD is required.
Backend developers must ensure continuous contribution.
Keep release cycle the same with vLLM to make sure the backend can be used all the time.

For this kind of backend, we can call it official support. Once if community wants to move the inner hardware code to out-of-tree or a new backend is added, it can follow this rule.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

wangxiyuan · 2024-12-19T06:31:43Z

This comment keep collecting all the PRs related to platform pluggable

Merged:

Refactor

Plugin

[platforms] enable platform plugins #11602

Need Review

[Misc] Add get_stream_cls() method for Platform class #14411

simon-mo · 2025-01-06T19:23:57Z

For more ephemeral conversations, please join the vLLM slack and join #sig-extensible-hardware channel to discussion!

wangxiyuan added the RFC label Dec 13, 2024

This was referenced Dec 13, 2024

[platform] Move executor class init to platform #11085

Closed

[Platform] Add platform pluggable framework #11222

Closed

[platform] support custom torch.compile backend key #11318

Merged

MengqingCao mentioned this issue Dec 19, 2024

[Distributed][refactor] Add base class for device-specific communicator #11324

Closed

wangxiyuan mentioned this issue Dec 19, 2024

[platform] support pytorch custom op pluggable #11328

Merged

MengqingCao mentioned this issue Dec 26, 2024

[Platform] Move model arch check to platform #11503

Merged

This was referenced Dec 26, 2024

[Platform] Move get_punica_wrapper() function to Platform #11516

Merged

[Platform] Refactor current_memory_usage() function in DeviceMemoryProfiler to Platform #11369

Merged

wangxiyuan mentioned this issue Dec 30, 2024

[platform] Allow platform specify attention backend #11609

Merged

simon-mo mentioned this issue Jan 8, 2025

[Roadmap] vLLM Roadmap Q1 2025 #11862

Open

38 tasks

This was referenced Jan 13, 2025

[Platform] Add output for Attention Backend #11981

Merged

[Platform] Do not raise error if _Backend is not found #12023

Merged

wangxiyuan mentioned this issue Jan 29, 2025

[Core]Init vllm-ascend vllm-project/vllm-ascend#2

Closed

liangfu mentioned this issue Feb 9, 2025

[RFC]: Device-agnostic Abstraction for V1 Architecture #12992

Closed

1 task

yannicks1 mentioned this issue Feb 12, 2025

[Feature] Pluggable platform-specific scheduler #13161

Merged

Yikun mentioned this issue Feb 17, 2025

vLLM Ascend Roadmap Q1 2025 vllm-project/vllm-ascend#71

Open

37 tasks

ji-huazhong mentioned this issue Feb 20, 2025

[Platform] Refactor memory manage function in memory_profiling to Platform #13599

Closed

Titus-von-Koeller mentioned this issue Mar 6, 2025

[spike] evaluate + potentially prototype out-of-tree backend registration bitsandbytes-foundation/bitsandbytes#1557

Open

4 tasks

simon-mo mentioned this issue Mar 6, 2025

[RFC]: Drop Support for OpenVINO #14374

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Hardware pluggable #11162

[RFC]: Hardware pluggable #11162

wangxiyuan commented Dec 13, 2024 •

edited

Loading

wangxiyuan commented Dec 19, 2024 •

edited

Loading

simon-mo commented Jan 6, 2025

[RFC]: Hardware pluggable #11162

[RFC]: Hardware pluggable #11162

Comments

wangxiyuan commented Dec 13, 2024 • edited Loading

Motivation.

Proposed Change.

Pluggable

Refactor

Feedback Period.

CC List.

Any Other Things.

CI/CD

V1 Engine

Plugin location

Before submitting a new issue...

wangxiyuan commented Dec 19, 2024 • edited Loading

Merged:

Need Review

simon-mo commented Jan 6, 2025

wangxiyuan commented Dec 13, 2024 •

edited

Loading

wangxiyuan commented Dec 19, 2024 •

edited

Loading