[bugfix] some bugs maybe fail to run #896

NINGBENZHE · 2025-05-19T07:21:31Z

What this PR does / why we need it?

Solve the bug that the graph mode is the same as p and d, and some other bugs.

Does this PR introduce any user-facing change?

Wouldn't be

How was this patch tested?

Follow the end-to-end test

jianzs · 2025-05-19T16:19:34Z

vllm_ascend/patch/platform/patch_common/patch_distributed.py

-    port = int(os.environ.get("MASTER_PORT", answer))  # type: ignore
+    port = int(os.environ.get("VLLM_DP_MASTER_PORT", answer))  # type: ignore


using envs.VLLM_DP_MASTER_PORT is better?

MengqingCao · 2025-05-28T09:12:46Z

vllm_ascend/patch/platform/patch_common/patch_distributed.py

 from torch.distributed import ProcessGroup
 from torch.distributed.distributed_c10d import (Backend, PrefixStore,
                                                _get_default_timeout,
                                                is_nccl_available)
 from torch.distributed.rendezvous import rendezvous
 from vllm.config import ParallelConfig

+_DP_GROUP = None


There is still process group for dp in vllm now, why we add this here?

This is used to determine whether to execute the dummy_run of prefill process. The native stateless process does not have global variables to obtain.

MengqingCao · 2025-05-28T09:17:29Z

vllm_ascend/distributed/parallel_state.py

@@ -21,12 +21,18 @@ def get_etp_group() -> GroupCoordinator:
    return _ETP


+def model_parallel_initialized():
+    return (_ETP is not None and _EP is not None)


I think we could use ep without etp, thus this will break this senario

No. If ETP is not enabled, communication groups will still be created.

Signed-off-by: ningbenzhe1 <[email protected]>

jianzs reviewed May 19, 2025

View reviewed changes

NINGBENZHE force-pushed the main branch 3 times, most recently from 9df5c0f to 293eefe Compare May 26, 2025 01:10

github-actions bot added the module:ops label May 26, 2025

NINGBENZHE force-pushed the main branch from 293eefe to 115b0e1 Compare May 28, 2025 09:13

MengqingCao reviewed May 28, 2025

View reviewed changes

NINGBENZHE closed this May 28, 2025

NINGBENZHE force-pushed the main branch from 115b0e1 to e2a0c19 Compare May 28, 2025 11:18

fix graph mode pd stuck

a0af061

Signed-off-by: ningbenzhe1 <[email protected]>

NINGBENZHE reopened this May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] some bugs maybe fail to run #896

[bugfix] some bugs maybe fail to run #896

NINGBENZHE commented May 19, 2025 •

edited

Loading

Uh oh!

jianzs May 19, 2025

Uh oh!

NINGBENZHE May 21, 2025

Uh oh!

MengqingCao May 28, 2025

Uh oh!

NINGBENZHE May 28, 2025

Uh oh!

MengqingCao May 28, 2025

Uh oh!

NINGBENZHE May 28, 2025

Uh oh!

Uh oh!

		port = int(os.environ.get("MASTER_PORT", answer)) # type: ignore
		port = int(os.environ.get("VLLM_DP_MASTER_PORT", answer)) # type: ignore

[bugfix] some bugs maybe fail to run #896

Are you sure you want to change the base?

[bugfix] some bugs maybe fail to run #896

Conversation

NINGBENZHE commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jianzs May 19, 2025

Choose a reason for hiding this comment

Uh oh!

NINGBENZHE May 21, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao May 28, 2025

Choose a reason for hiding this comment

Uh oh!

NINGBENZHE May 28, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao May 28, 2025

Choose a reason for hiding this comment

Uh oh!

NINGBENZHE May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NINGBENZHE commented May 19, 2025 •

edited

Loading