Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] TP Ray executor #11107

Merged
merged 19 commits into from
Dec 23, 2024
Merged

[V1] TP Ray executor #11107

merged 19 commits into from
Dec 23, 2024

Conversation

ruisearch42
Copy link
Collaborator

@ruisearch42 ruisearch42 commented Dec 11, 2024

Support Ray Compiled Graphs based executor in V1

Perf results on L4 GPU:
(Turned off prefix caching for both there is an issue in main branch which is being fixed right now).
Perf is at parity, Ray is about ~0.7% slower.

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --num-iters-warmup 5 --num-iters 20 --batch-size 8 --input-len 128 --output-len 256 --max-model-len 2048 --no-enable-prefix-caching --distributed-executor-backend ray
Avg latency: 9.42691584636923 seconds
10% percentile latency: 9.391981961019336 seconds
25% percentile latency: 9.410091992700472 seconds
50% percentile latency: 9.425441902829334 seconds
75% percentile latency: 9.445964987040497 seconds
90% percentile latency: 9.456209814222529 seconds
99% percentile latency: 9.459444067198783 seconds

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --num-iters-warmup 5 --num-iters 20 --batch-size 8 --input-len 128 --output-len 256 --max-model-len 2048 --no-enable-prefix-caching
Avg latency: 9.356813629437237 seconds
10% percentile latency: 9.326405790261925 seconds
25% percentile latency: 9.342471541371197 seconds
50% percentile latency: 9.360302247805521 seconds
75% percentile latency: 9.37355133111123 seconds
90% percentile latency: 9.380363538581879 seconds
99% percentile latency: 9.386704193898476 seconds

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ruisearch42
Copy link
Collaborator Author

/ready

@ruisearch42 ruisearch42 changed the title [WIP][V1] TP Ray executor [V1] TP Ray executor Dec 13, 2024
@ruisearch42 ruisearch42 marked this pull request as ready for review December 13, 2024 16:04
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

vllm/v1/executor/ray_utils.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024
@ruisearch42 ruisearch42 force-pushed the v1_tp_raycg branch 3 times, most recently from 333aca4 to 7807eb1 Compare December 17, 2024 16:22
@DarkLight1337
Copy link
Member

Please merge from main to resolve the CI failure.

@WoosukKwon
Copy link
Collaborator

@tlrmchlsmth @youkaichao Can you please take a look?

@youkaichao
Copy link
Member

we should wait for #11256 . I don't want to duplicate the code, large parts of the code can be shared.

@comaniac
Copy link
Collaborator

comaniac commented Dec 18, 2024

we should wait for #11256 . I don't want to duplicate the code, large parts of the code can be shared.

I don't think we should be blocked by #11256 for the following reasons:

  1. [core] platform agnostic executor via collective_rpc #11256 touches a larger scope and seems not ready yet; while this PR is limited scoped and is ready. Even there might be temporary code duplication, we shouldn't let a PR blocks a ready-to-merge one.
  2. This PR is for v1. Even it could reuse code from v0 after [core] platform agnostic executor via collective_rpc #11256 is merged, I don't see any problem merging this PR first. On the other hand, please feel free to remove duplicated code in [core] platform agnostic executor via collective_rpc #11256 after this PR.

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: Why does execute_model use _compiled_ray_dag, while the other IPC calls use _run_workers? Why not use _compiled_ray_dag everywhere?

@@ -130,7 +130,7 @@ def test_models_distributed(
# Import VLLM_USE_V1 dynamically to handle patching
from vllm.envs import VLLM_USE_V1
if VLLM_USE_V1 and distributed_executor_backend != "mp":
pytest.skip(f"Skip {distributed_executor_backend} for V1")
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why disable this? I thought this should be able to work with the Ray executor

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not work with ray executor. But I think it makes sense this option is for MP only as Ray is a different executor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should have anything to do with the executor that's used. #9826 is the PR that introduced it and has a nice diagram of what's going on.

What happens if we try to run the Ray executor with the AsyncLLM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using VLLM_ENABLE_V1_MULTIPROCESSING, it actually caused ray worker initialization to hang, because of an uninitialized ray environment in the new process.

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix this, since AsyncLLM allows detokenization to be overlapped with the forward pass, and is the default in V1 now.

We probably want to fix this by initializing the ray environment in the entry point for process P1 that owns the executor:

vllm/vllm/v1/engine/core.py

Lines 239 to 276 in 48edab8

@staticmethod
def run_engine_core(*args, **kwargs):
"""Launch EngineCore busy loop in background process."""
# Signal handler used for graceful termination.
# SystemExit exception is only raised once to allow this and worker
# processes to terminate without error
shutdown_requested = False
# Ensure we can serialize transformer config after spawning
maybe_register_config_serialize_by_value()
def signal_handler(signum, frame):
nonlocal shutdown_requested
if not shutdown_requested:
shutdown_requested = True
raise SystemExit()
# Either SIGTERM or SIGINT will terminate the engine_core
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
engine_core = None
try:
engine_core = EngineCoreProc(*args, **kwargs)
engine_core.run_busy_loop()
except SystemExit:
logger.debug("EngineCore interrupted.")
except BaseException as e:
logger.exception(e)
raise e
finally:
if engine_core is not None:
engine_core.shutdown()
engine_core = None

I don't think this should be a blocker for this PR but could you look into fixing this in a follow-up soon?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sounds good. I will follow up soon.

vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
@youkaichao
Copy link
Member

@ruisearch42 can you explain the difference between the vllm/v1/executor/ray_executor.py and vllm/executor/ray_gpu_executor.py for the sake of review?

@ruisearch42
Copy link
Collaborator Author

One question: Why does execute_model use _compiled_ray_dag, while the other IPC calls use _run_workers? Why not use _compiled_ray_dag everywhere?

Good question. execute_model is the method that is repeatedly called, so we set up a Ray Compiled Graph for repeated execution of the same functionality, which avoids recreating underlying Ray Objects for each execution. _run_workers is only used at initialization time (although called a few times), and there is not much benefit using the Ray Compiled Graph API, and we can simply use the Ray Core API.

@tlrmchlsmth
Copy link
Collaborator

One question: Why does execute_model use _compiled_ray_dag, while the other IPC calls use _run_workers? Why not use _compiled_ray_dag everywhere?

Good question. execute_model is the method that is repeatedly called, so we set up a Ray Compiled Graph for repeated execution of the same functionality, which avoids recreating underlying Ray Objects for each execution. _run_workers is only used at initialization time (although called a few times), and there is not much benefit using the Ray Compiled Graph API, and we can simply use the Ray Core API.

That makes sense, thanks. To clarify: can the ray compiled graph only be used for execute_model? I.E. when we add other hot path functions will they need to have their own ray compiled graphs?

@ruisearch42
Copy link
Collaborator Author

@ruisearch42 can you explain the difference between the vllm/v1/executor/ray_executor.py and vllm/executor/ray_gpu_executor.py for the sake of review?

Hi yeah, the V1 executor only supports SPMD mode, and non-SPMD code path is cleaned up. Also, in future, the PP implementation will not use virtual engine, so the structuring/interface of the executor will be different.

@ruisearch42
Copy link
Collaborator Author

Yes, ray compiled graph is only used for distributed model execution (worker.execute_model) so far and there is no plan to include other parts. So other functionality should be orthogonal.

What about other "hot path" operations like add/remove lora?

If the data size is large or requires GPU-GPU data transfer, then perhaps we can optimize with Compiled Graphs. But AFIAK it is not the case, so using the normal ray core API should be good.

vllm/v1/executor/ray_utils.py Outdated Show resolved Hide resolved
@@ -130,7 +130,7 @@ def test_models_distributed(
# Import VLLM_USE_V1 dynamically to handle patching
from vllm.envs import VLLM_USE_V1
if VLLM_USE_V1 and distributed_executor_backend != "mp":
pytest.skip(f"Skip {distributed_executor_backend} for V1")
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should have anything to do with the executor that's used. #9826 is the PR that introduced it and has a nice diagram of what's going on.

What happens if we try to run the Ray executor with the AsyncLLM?

vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Outdated Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Show resolved Hide resolved
vllm/v1/executor/ray_executor.py Show resolved Hide resolved
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good to me!

@ruisearch42
Copy link
Collaborator Author

ruisearch42 commented Dec 20, 2024

Hey @youkaichao , We plan to merge the PR soon, could you please take a look.

Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
@comaniac comaniac enabled auto-merge (squash) December 23, 2024 19:11
@comaniac comaniac merged commit a491d6f into vllm-project:main Dec 23, 2024
51 checks passed
ayylemao pushed a commit to ayylemao/vllm that referenced this pull request Dec 24, 2024
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants