Skip to content

[bugfix] some bugs maybe fail to run #896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NINGBENZHE
Copy link

@NINGBENZHE NINGBENZHE commented May 19, 2025

What this PR does / why we need it?

Solve the bug that the graph mode is the same as p and d, and some other bugs.

Does this PR introduce any user-facing change?

Wouldn't be

How was this patch tested?

Follow the end-to-end test

port = int(os.environ.get("MASTER_PORT", answer)) # type: ignore
port = int(os.environ.get("VLLM_DP_MASTER_PORT", answer)) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using envs.VLLM_DP_MASTER_PORT is better?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

from torch.distributed import ProcessGroup
from torch.distributed.distributed_c10d import (Backend, PrefixStore,
_get_default_timeout,
is_nccl_available)
from torch.distributed.rendezvous import rendezvous
from vllm.config import ParallelConfig

_DP_GROUP = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still process group for dp in vllm now, why we add this here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to determine whether to execute the dummy_run of prefill process. The native stateless process does not have global variables to obtain.

@@ -21,12 +21,18 @@ def get_etp_group() -> GroupCoordinator:
return _ETP


def model_parallel_initialized():
return (_ETP is not None and _EP is not None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use ep without etp, thus this will break this senario

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If ETP is not enabled, communication groups will still be created.

Signed-off-by: ningbenzhe1 <[email protected]>
@NINGBENZHE NINGBENZHE reopened this May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants