-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] TP Ray executor #11107
[V1] TP Ray executor #11107
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
/ready |
5c9d11a
to
feba15e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
f33563f
to
3abcb67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @WoosukKwon @njhill @tlrmchlsmth
333aca4
to
7807eb1
Compare
Please merge from main to resolve the CI failure. |
@tlrmchlsmth @youkaichao Can you please take a look? |
231233f
to
abcd185
Compare
we should wait for #11256 . I don't want to duplicate the code, large parts of the code can be shared. |
abcd185
to
c6b97d2
Compare
I don't think we should be blocked by #11256 for the following reasons:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: Why does execute_model
use _compiled_ray_dag
, while the other IPC calls use _run_workers
? Why not use _compiled_ray_dag
everywhere?
@@ -130,7 +130,7 @@ def test_models_distributed( | |||
# Import VLLM_USE_V1 dynamically to handle patching | |||
from vllm.envs import VLLM_USE_V1 | |||
if VLLM_USE_V1 and distributed_executor_backend != "mp": | |||
pytest.skip(f"Skip {distributed_executor_backend} for V1") | |||
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why disable this? I thought this should be able to work with the Ray executor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not work with ray executor. But I think it makes sense this option is for MP only as Ray is a different executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should have anything to do with the executor that's used. #9826 is the PR that introduced it and has a nice diagram of what's going on.
What happens if we try to run the Ray executor with the AsyncLLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using VLLM_ENABLE_V1_MULTIPROCESSING
, it actually caused ray worker initialization to hang, because of an uninitialized ray environment in the new process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix this, since AsyncLLM allows detokenization to be overlapped with the forward pass, and is the default in V1 now.
We probably want to fix this by initializing the ray environment in the entry point for process P1
that owns the executor:
Lines 239 to 276 in 48edab8
@staticmethod | |
def run_engine_core(*args, **kwargs): | |
"""Launch EngineCore busy loop in background process.""" | |
# Signal handler used for graceful termination. | |
# SystemExit exception is only raised once to allow this and worker | |
# processes to terminate without error | |
shutdown_requested = False | |
# Ensure we can serialize transformer config after spawning | |
maybe_register_config_serialize_by_value() | |
def signal_handler(signum, frame): | |
nonlocal shutdown_requested | |
if not shutdown_requested: | |
shutdown_requested = True | |
raise SystemExit() | |
# Either SIGTERM or SIGINT will terminate the engine_core | |
signal.signal(signal.SIGTERM, signal_handler) | |
signal.signal(signal.SIGINT, signal_handler) | |
engine_core = None | |
try: | |
engine_core = EngineCoreProc(*args, **kwargs) | |
engine_core.run_busy_loop() | |
except SystemExit: | |
logger.debug("EngineCore interrupted.") | |
except BaseException as e: | |
logger.exception(e) | |
raise e | |
finally: | |
if engine_core is not None: | |
engine_core.shutdown() | |
engine_core = None |
I don't think this should be a blocker for this PR but could you look into fixing this in a follow-up soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sounds good. I will follow up soon.
@ruisearch42 can you explain the difference between the |
Good question. |
That makes sense, thanks. To clarify: can the ray compiled graph only be used for |
Hi yeah, the V1 executor only supports SPMD mode, and non-SPMD code path is cleaned up. Also, in future, the PP implementation will not use virtual engine, so the structuring/interface of the executor will be different. |
If the data size is large or requires GPU-GPU data transfer, then perhaps we can optimize with Compiled Graphs. But AFIAK it is not the case, so using the normal ray core API should be good. |
@@ -130,7 +130,7 @@ def test_models_distributed( | |||
# Import VLLM_USE_V1 dynamically to handle patching | |||
from vllm.envs import VLLM_USE_V1 | |||
if VLLM_USE_V1 and distributed_executor_backend != "mp": | |||
pytest.skip(f"Skip {distributed_executor_backend} for V1") | |||
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should have anything to do with the executor that's used. #9826 is the PR that introduced it and has a nice diagram of what's going on.
What happens if we try to run the Ray executor with the AsyncLLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looks good to me!
Hey @youkaichao , We plan to merge the PR soon, could you please take a look. |
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
b37ab39
to
33e5599
Compare
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Support Ray Compiled Graphs based executor in V1
Perf results on L4 GPU:
(Turned off prefix caching for both there is an issue in main branch which is being fixed right now).
Perf is at parity, Ray is about ~0.7% slower.
VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --num-iters-warmup 5 --num-iters 20 --batch-size 8 --input-len 128 --output-len 256 --max-model-len 2048 --no-enable-prefix-caching --distributed-executor-backend ray
Avg latency: 9.42691584636923 seconds
10% percentile latency: 9.391981961019336 seconds
25% percentile latency: 9.410091992700472 seconds
50% percentile latency: 9.425441902829334 seconds
75% percentile latency: 9.445964987040497 seconds
90% percentile latency: 9.456209814222529 seconds
99% percentile latency: 9.459444067198783 seconds
VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --num-iters-warmup 5 --num-iters 20 --batch-size 8 --input-len 128 --output-len 256 --max-model-len 2048 --no-enable-prefix-caching
Avg latency: 9.356813629437237 seconds
10% percentile latency: 9.326405790261925 seconds
25% percentile latency: 9.342471541371197 seconds
50% percentile latency: 9.360302247805521 seconds
75% percentile latency: 9.37355133111123 seconds
90% percentile latency: 9.380363538581879 seconds
99% percentile latency: 9.386704193898476 seconds