[Bugfix] Fix ray instance detect issue #9439

yma11 · 2024-10-17T00:49:45Z

Fix ray instance detect so that will first try connecting to latest launched instance and if not, create a new one with num_gpus=parallel_config.world_size.

github-actions · 2024-10-17T00:49:58Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/executor/ray_utils.py

yma11 · 2024-10-21T02:28:11Z

@youkaichao can you help review this change? Thanks.

comaniac

Overall LGTM. One question is should we also change the other branch?

comaniac · 2024-10-22T06:08:03Z

vllm/executor/ray_utils.py

-                 num_gpus=parallel_config.world_size)
+        # Try to connect existing ray instance and create a new one if not found
+        try:
+            ray.init('auto')


Use double quotes for consistency.

yma11 · 2024-10-22T07:09:23Z

Overall LGTM. One question is should we also change the other branch?

Agree. So I unify the init logic which should make sense for all platforms, please help take a look again. Thanks.

comaniac · 2024-10-22T07:25:37Z

The update code seems not equivalent to the original one? Currently for non-hip and non-xpu cases we don't init Ray with all GPUs.

yma11 · 2024-10-22T09:40:14Z

The update code seems not equivalent to the original one? Currently for non-hip and non-xpu cases we don't init Ray with all GPUs.

For non-hip and non-xpu cases, it will finally create a local instance with detected gpus if fails to connect existing cluster based on explanation.
Actually I intended to fix an error "When connecting to an existing cluster, num_cpus and num_gpus must not be provided." in xpu case. It happens when a valid ray_address and num_gpus are both given. I want to respect both of these values but seems the confliction can't be resolved. Maybe it's more reasonable to do ray.init(address=ray_address, ignore_reinit_error=True) for all platforms. num_gpus=parallel_config.world_size is expected to take affect only when new local instance created but it's not so meaningful in that case. What do you think?

comaniac · 2024-10-22T15:29:55Z

Sounds reasonable to me, but cc @rkooo567 @richardliaw to double check.

youkaichao · 2024-10-24T03:10:43Z

@yma11 please resolve the conflict

yma11 · 2024-10-24T03:47:08Z

@youkaichao Thanks for reminder. @comaniac I switched the fix back to only change hip and xpu code path since there is an possible issue on these platforms. When there is no ray cluster existing and trying to launch a new instance, Ray may can't detect correct GPU numbers thus will cause no GPU resources available for ray worker allocation. So we need give num_gpus as the argument in this case. That's why this specific code path exists here. FYI and thanks for your review.

Signed-off-by: yan ma <[email protected]>

youkaichao · 2024-10-27T23:33:26Z

@DarkLight1337 please help check is the error related or it occurs in the main branch previously?

DarkLight1337 · 2024-10-28T03:03:12Z

It is a failure from main branch that has since been fixed. You can force merge this.

Signed-off-by: Shanshan Wang <[email protected]>

Signed-off-by: qishuai <[email protected]>

Signed-off-by: Randall Smith <[email protected]>

Signed-off-by: NickLucche <[email protected]>

Signed-off-by: Linkun Chen <[email protected]>

Signed-off-by: Sumit Dubey <[email protected]>

Signed-off-by: Maxime Fournioux <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]>

russellb suggested changes Oct 17, 2024

View reviewed changes

vllm/executor/ray_utils.py Outdated Show resolved Hide resolved

yma11 force-pushed the ray-fix branch from 8e65f36 to 2147ca5 Compare October 18, 2024 07:21

yma11 force-pushed the ray-fix branch from 2147ca5 to 13f04c4 Compare October 22, 2024 06:03

comaniac reviewed Oct 22, 2024

View reviewed changes

youkaichao assigned rkooo567 Oct 24, 2024

rkooo567 approved these changes Oct 24, 2024

View reviewed changes

yma11 force-pushed the ray-fix branch from 74b9123 to a57b4bc Compare October 24, 2024 03:39

yma11 force-pushed the ray-fix branch 2 times, most recently from bc652ab to af18da6 Compare October 24, 2024 11:55

Fix ray instance detect issue

af18da6

Signed-off-by: yan ma <[email protected]>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 24, 2024

comaniac enabled auto-merge (squash) October 24, 2024 15:27

Merge branch 'main' into ray-fix

55e6b39

comaniac merged commit 2adb440 into vllm-project:main Oct 28, 2024
58 checks passed

HollowMan6 mentioned this pull request Oct 28, 2024

[Bugfix] No num_gpus for ROCm and XPU when connecting to a ray cluster #8781

Closed

cooleel pushed a commit to cooleel/vllm that referenced this pull request Oct 28, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

c6e592d

Signed-off-by: Shanshan Wang <[email protected]>

cooleel pushed a commit to cooleel/vllm that referenced this pull request Oct 28, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

30ba48f

Signed-off-by: Shanshan Wang <[email protected]>

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

499cad6

Signed-off-by: qishuai <[email protected]>

rasmith pushed a commit to rasmith/vllm that referenced this pull request Oct 30, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

a26f8ea

Signed-off-by: Randall Smith <[email protected]>

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Oct 31, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

672c0f0

Signed-off-by: NickLucche <[email protected]>

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Oct 31, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

fa8ba4c

Signed-off-by: NickLucche <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Nov 4, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

bd811bd

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Nov 4, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

4645cd6

Signed-off-by: Linkun Chen <[email protected]>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

9ad4845

Signed-off-by: Sumit Dubey <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

cc43f98

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

e8758a1

Signed-off-by: Maxime Fournioux <[email protected]>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

2244c34

Signed-off-by: Tyler Michael Smith <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

c211880

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix ray instance detect issue #9439

[Bugfix] Fix ray instance detect issue #9439

yma11 commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

yma11 commented Oct 21, 2024

comaniac left a comment

comaniac Oct 22, 2024

yma11 commented Oct 22, 2024

comaniac commented Oct 22, 2024

yma11 commented Oct 22, 2024 •

edited

Loading

comaniac commented Oct 22, 2024

youkaichao commented Oct 24, 2024

yma11 commented Oct 24, 2024

youkaichao commented Oct 27, 2024

DarkLight1337 commented Oct 28, 2024

[Bugfix] Fix ray instance detect issue #9439

[Bugfix] Fix ray instance detect issue #9439

Conversation

yma11 commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

yma11 commented Oct 21, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac Oct 22, 2024

Choose a reason for hiding this comment

yma11 commented Oct 22, 2024

comaniac commented Oct 22, 2024

yma11 commented Oct 22, 2024 • edited Loading

comaniac commented Oct 22, 2024

youkaichao commented Oct 24, 2024

yma11 commented Oct 24, 2024

youkaichao commented Oct 27, 2024

DarkLight1337 commented Oct 28, 2024

yma11 commented Oct 22, 2024 •

edited

Loading