[V1] TP Ray executor #11107

ruisearch42 · 2024-12-11T16:27:24Z

Support Ray Compiled Graphs based executor in V1

Perf results on L4 GPU:
(Turned off prefix caching for both there is an issue in main branch which is being fixed right now).
Perf is at parity, Ray is about ~0.7% slower.

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --num-iters-warmup 5 --num-iters 20 --batch-size 8 --input-len 128 --output-len 256 --max-model-len 2048 --no-enable-prefix-caching --distributed-executor-backend ray
Avg latency: 9.42691584636923 seconds
10% percentile latency: 9.391981961019336 seconds
25% percentile latency: 9.410091992700472 seconds
50% percentile latency: 9.425441902829334 seconds
75% percentile latency: 9.445964987040497 seconds
90% percentile latency: 9.456209814222529 seconds
99% percentile latency: 9.459444067198783 seconds

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --num-iters-warmup 5 --num-iters 20 --batch-size 8 --input-len 128 --output-len 256 --max-model-len 2048 --no-enable-prefix-caching
Avg latency: 9.356813629437237 seconds
10% percentile latency: 9.326405790261925 seconds
25% percentile latency: 9.342471541371197 seconds
50% percentile latency: 9.360302247805521 seconds
75% percentile latency: 9.37355133111123 seconds
90% percentile latency: 9.380363538581879 seconds
99% percentile latency: 9.386704193898476 seconds

github-actions · 2024-12-11T16:27:36Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

ruisearch42 · 2024-12-12T22:54:24Z

/ready

comaniac

Overall LGTM

vllm/v1/executor/ray_utils.py

vllm/v1/executor/ray_executor.py

comaniac

LGTM. cc @WoosukKwon @njhill @tlrmchlsmth

DarkLight1337 · 2024-12-18T02:18:27Z

Please merge from main to resolve the CI failure.

WoosukKwon · 2024-12-18T06:21:35Z

@tlrmchlsmth @youkaichao Can you please take a look?

youkaichao · 2024-12-18T09:05:49Z

we should wait for #11256 . I don't want to duplicate the code, large parts of the code can be shared.

comaniac · 2024-12-18T16:37:44Z

we should wait for #11256 . I don't want to duplicate the code, large parts of the code can be shared.

I don't think we should be blocked by #11256 for the following reasons:

[core] platform agnostic executor via collective_rpc #11256 touches a larger scope and seems not ready yet; while this PR is limited scoped and is ready. Even there might be temporary code duplication, we shouldn't let a PR blocks a ready-to-merge one.
This PR is for v1. Even it could reuse code from v0 after [core] platform agnostic executor via collective_rpc #11256 is merged, I don't see any problem merging this PR first. On the other hand, please feel free to remove duplicated code in [core] platform agnostic executor via collective_rpc #11256 after this PR.

tlrmchlsmth

One question: Why does execute_model use _compiled_ray_dag, while the other IPC calls use _run_workers? Why not use _compiled_ray_dag everywhere?

tlrmchlsmth · 2024-12-18T18:45:32Z

tests/basic_correctness/test_basic_correctness.py

Why disable this? I thought this should be able to work with the Ray executor

It does not work with ray executor. But I think it makes sense this option is for MP only as Ray is a different executor.

I don't think this should have anything to do with the executor that's used. #9826 is the PR that introduced it and has a nice diagram of what's going on.

What happens if we try to run the Ray executor with the AsyncLLM?

When using VLLM_ENABLE_V1_MULTIPROCESSING, it actually caused ray worker initialization to hang, because of an uninitialized ray environment in the new process.

We should fix this, since AsyncLLM allows detokenization to be overlapped with the forward pass, and is the default in V1 now.

We probably want to fix this by initializing the ray environment in the entry point for process P1 that owns the executor:

vllm/vllm/v1/engine/core.py

Lines 239 to 276 in 48edab8

@staticmethod

def run_engine_core(*args, **kwargs):

"""Launch EngineCore busy loop in background process."""

# Signal handler used for graceful termination.

# SystemExit exception is only raised once to allow this and worker

# processes to terminate without error

shutdown_requested = False

# Ensure we can serialize transformer config after spawning

maybe_register_config_serialize_by_value()

def signal_handler(signum, frame):

nonlocal shutdown_requested

if not shutdown_requested:

shutdown_requested = True

raise SystemExit()

# Either SIGTERM or SIGINT will terminate the engine_core

signal.signal(signal.SIGTERM, signal_handler)

signal.signal(signal.SIGINT, signal_handler)

engine_core = None

try:

engine_core = EngineCoreProc(*args, **kwargs)

engine_core.run_busy_loop()

except SystemExit:

logger.debug("EngineCore interrupted.")

except BaseException as e:

logger.exception(e)

raise e

finally:

if engine_core is not None:

engine_core.shutdown()

engine_core = None

I don't think this should be a blocker for this PR but could you look into fixing this in a follow-up soon?

Thanks, sounds good. I will follow up soon.

vllm/v1/executor/ray_executor.py

youkaichao · 2024-12-18T21:37:29Z

@ruisearch42 can you explain the difference between the vllm/v1/executor/ray_executor.py and vllm/executor/ray_gpu_executor.py for the sake of review?

ruisearch42 · 2024-12-18T21:50:16Z

One question: Why does execute_model use _compiled_ray_dag, while the other IPC calls use _run_workers? Why not use _compiled_ray_dag everywhere?

Good question. execute_model is the method that is repeatedly called, so we set up a Ray Compiled Graph for repeated execution of the same functionality, which avoids recreating underlying Ray Objects for each execution. _run_workers is only used at initialization time (although called a few times), and there is not much benefit using the Ray Compiled Graph API, and we can simply use the Ray Core API.

tlrmchlsmth · 2024-12-18T22:43:26Z

One question: Why does execute_model use _compiled_ray_dag, while the other IPC calls use _run_workers? Why not use _compiled_ray_dag everywhere?

Good question. execute_model is the method that is repeatedly called, so we set up a Ray Compiled Graph for repeated execution of the same functionality, which avoids recreating underlying Ray Objects for each execution. _run_workers is only used at initialization time (although called a few times), and there is not much benefit using the Ray Compiled Graph API, and we can simply use the Ray Core API.

That makes sense, thanks. To clarify: can the ray compiled graph only be used for execute_model? I.E. when we add other hot path functions will they need to have their own ray compiled graphs?

ruisearch42 · 2024-12-18T23:24:09Z

@ruisearch42 can you explain the difference between the vllm/v1/executor/ray_executor.py and vllm/executor/ray_gpu_executor.py for the sake of review?

Hi yeah, the V1 executor only supports SPMD mode, and non-SPMD code path is cleaned up. Also, in future, the PP implementation will not use virtual engine, so the structuring/interface of the executor will be different.

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the v1_tp_raycg branch from 5c9d11a to feba15e Compare December 13, 2024 02:27

ruisearch42 changed the title ~~[WIP][V1] TP Ray executor~~ [V1] TP Ray executor Dec 13, 2024

ruisearch42 marked this pull request as ready for review December 13, 2024 16:04

ruisearch42 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners December 13, 2024 16:04

ruisearch42 assigned comaniac Dec 13, 2024

comaniac reviewed Dec 14, 2024

View reviewed changes

vllm/v1/executor/ray_utils.py Outdated Show resolved Hide resolved

vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved

vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved

ruisearch42 force-pushed the v1_tp_raycg branch from f33563f to 3abcb67 Compare December 15, 2024 00:59

comaniac approved these changes Dec 16, 2024

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024

ruisearch42 force-pushed the v1_tp_raycg branch 3 times, most recently from 333aca4 to 7807eb1 Compare December 17, 2024 16:22

ruisearch42 force-pushed the v1_tp_raycg branch from 231233f to abcd185 Compare December 18, 2024 06:42

ruisearch42 force-pushed the v1_tp_raycg branch from abcd185 to c6b97d2 Compare December 18, 2024 16:23

tlrmchlsmth reviewed Dec 18, 2024

View reviewed changes

ruisearch42 added 18 commits December 23, 2024 18:54

fix

4ea0cd0

Signed-off-by: Rui Qiao <[email protected]>

fix shutdown

1efa5bc

Signed-off-by: Rui Qiao <[email protected]>

cleanup

b1df508

Signed-off-by: Rui Qiao <[email protected]>

cleanup

4545d16

Signed-off-by: Rui Qiao <[email protected]>

cleanup

19a6d08

Signed-off-by: Rui Qiao <[email protected]>

cleanup

fd75e6e

Signed-off-by: Rui Qiao <[email protected]>

up

bcecb8b

Signed-off-by: Rui Qiao <[email protected]>

up

37bd5c4

Signed-off-by: Rui Qiao <[email protected]>

up

b46ee55

Signed-off-by: Rui Qiao <[email protected]>

up

0e9fe0f

Signed-off-by: Rui Qiao <[email protected]>

mypy

3e7e60d

Signed-off-by: Rui Qiao <[email protected]>

up

a2db3f8

Signed-off-by: Rui Qiao <[email protected]>

up

7909cf1

Signed-off-by: Rui Qiao <[email protected]>

up

3dfb1ea

Signed-off-by: Rui Qiao <[email protected]>

up

aa2e4f8

Signed-off-by: Rui Qiao <[email protected]>

up

8d362f7

Signed-off-by: Rui Qiao <[email protected]>

up

090caa0

Signed-off-by: Rui Qiao <[email protected]>

up

33e5599

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the v1_tp_raycg branch from b37ab39 to 33e5599 Compare December 23, 2024 18:54

comaniac enabled auto-merge (squash) December 23, 2024 19:11

comaniac merged commit a491d6f into vllm-project:main Dec 23, 2024
51 checks passed

ruisearch42 mentioned this pull request Dec 25, 2024

[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor #11472

Merged

mzusman pushed a commit to mzusman/vllm that referenced this pull request Mar 12, 2025

[V1] TP Ray executor (vllm-project#11107)

762b6ff

Signed-off-by: Rui Qiao <[email protected]>

	@staticmethod
	def run_engine_core(args, *kwargs):
	"""Launch EngineCore busy loop in background process."""

	# Signal handler used for graceful termination.
	# SystemExit exception is only raised once to allow this and worker
	# processes to terminate without error
	shutdown_requested = False

	# Ensure we can serialize transformer config after spawning
	maybe_register_config_serialize_by_value()

	def signal_handler(signum, frame):
	nonlocal shutdown_requested
	if not shutdown_requested:
	shutdown_requested = True
	raise SystemExit()

	# Either SIGTERM or SIGINT will terminate the engine_core
	signal.signal(signal.SIGTERM, signal_handler)
	signal.signal(signal.SIGINT, signal_handler)

	engine_core = None
	try:
	engine_core = EngineCoreProc(args, *kwargs)
	engine_core.run_busy_loop()

	except SystemExit:
	logger.debug("EngineCore interrupted.")

	except BaseException as e:
	logger.exception(e)
	raise e

	finally:
	if engine_core is not None:
	engine_core.shutdown()
	engine_core = None

Uh oh!

[V1] TP Ray executor #11107

[V1] TP Ray executor #11107

Uh oh!

Conversation

ruisearch42 commented Dec 11, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 11, 2024

Uh oh!

ruisearch42 commented Dec 12, 2024

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Dec 18, 2024

Uh oh!

WoosukKwon commented Dec 18, 2024

Uh oh!

youkaichao commented Dec 18, 2024

Uh oh!

comaniac commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

ruisearch42 Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

ruisearch42 Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisearch42 Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkaichao commented Dec 18, 2024

Uh oh!

ruisearch42 commented Dec 18, 2024

Uh oh!

tlrmchlsmth commented Dec 18, 2024

Uh oh!

ruisearch42 commented Dec 18, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ruisearch42 commented Dec 11, 2024 •

edited by github-actions bot

Loading

comaniac commented Dec 18, 2024 •

edited

Loading

tlrmchlsmth Dec 20, 2024 •

edited

Loading