Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin refactoring executor_base ABC #9392

Closed
wants to merge 2 commits into from

Conversation

jberkhahn
Copy link
Contributor

This was started in support of mypy'ing the remaining libs that need it. It's been extremely difficult to add mypy to anything that touches the executor, because the various diffrent kinds of backends are structured differently, but everything is generally referred to via the abstract base type and then methods are called blindly because we "know" which kind of executor we're dealing with, even if the code doesn't. In addition, the various executor implementations often implement similar functionality decomposed differently - so the same bit of functionality often exists in different places or is referenced differently in different backends.

This is the beginning of a reactor that aims to create more abstract method declarations in executor_base, to allow executor code to be statically type checked, as well as to hopefully let things be structured in a more consistent manner that is easy to understand. This does occasionally mean that a particular backend will have a kind of dummy implementation that doesn't do much, as with the ray_tpu_executor here.

This PR just starts with the _create_worker method, which I've changed to init the driver worker, but not init the device or load the model across all various backend implementations.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@@ -50,6 +50,15 @@ def __init__(
def _init_executor(self) -> None:
pass

@abstractmethod
def _create_worker(self,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name starting with underline should be an internal function in general practice. I'm not sure if it's a suitable abstract function name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many of the executors already have a _create_worker method, and it seems to be used internally, so I think this name fits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jberkhahn apologies I was looking at this again and I'm actually also unsure of why we would add this in the base class, since it's only ever used in a private context. There isn't any place where we call _create_worker on a generic executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declaring it here let's me force a standardized set of args to make sure all the _create_worker methods do roughly the same steps in the different executors. currently the separation of functionality is a bit different in different executors. I realize declaring a private method part of an interface is a bit unusual, there are spots where we do reach in and call private methods, so I'm going to have to declare those in the ABC at some point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case does that make sense to just promote it to create_worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could do that. the specific instance i'm referring to where stuff reaches in and calls private methods is stuff like the calls to _run_workers in vllm/engine/llm_engine.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i went and changed it to create_worker. (incidentally, cpu_executor has a now private method for boostrapping it's worker, which now makes more sense with this naming scheme)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jberkhahn I'm sorry I still can't wrap my head around why we want this in the top level interface unless there's somewhere in the code that calls it on arbitrary executors (which could include other non-abstract methods in ExecutorBase).

Does this fix an existing typing issue?

It makes sense to standardize how things are done across the implementations but given the current scoping of what's the concern of each particular impl, it's not "wrong" for them to do this differently.

So concretely my suggestions would be to remove it here and keep the private _create_worker naming across the classes.

# The driver dummy worker does not actually use any resources.
# It holds the resource for the driver worker.
self.driver_dummy_worker: Optional[RayWorkerWrapper] = None
# The remaining workers are the actual ray actors.
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is... definitely confusing for me, as I have no other context about how this code works.

Should the code from _init_workers_ray really be here under create_worker since that's what actually creates the driver_dummy_worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, left a bit on the cutting room floor here, sorry. This is one of the parts that is kind of janky, tho. The ray tpu doesn't have a driver worker, so it just sets it to None, which I extracted into the _create_worker method to have an implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jberkhahn the changes in this file also don't make sense to me. _create_worker is implemented in the TPUExecutor superclass, there's no need to override it here. It's just not used in this class.

Also I don't see what the benefit is in changing this line:

self.driver_dummy_worker: Optional[RayWorkerWrapper] = None

it seems clearer as it is currently. Compare with the RayGPUExecutor class hierarchy which is similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i went ahead and removed this bit, it wasn't making things clearer

@jberkhahn jberkhahn force-pushed the refactor_executor branch 5 times, most recently from 110ceeb to 90b46aa Compare October 24, 2024 23:12
@jberkhahn jberkhahn force-pushed the refactor_executor branch 2 times, most recently from 2826f97 to 0fc499d Compare November 1, 2024 22:47
Copy link

mergify bot commented Nov 2, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @jberkhahn please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 2, 2024
@mergify mergify bot removed the needs-rebase label Nov 5, 2024
@jberkhahn jberkhahn force-pushed the refactor_executor branch 3 times, most recently from 4411794 to 01a40cb Compare November 5, 2024 22:55
@jberkhahn
Copy link
Contributor Author

had to fix merge conflicts because of #9938, some interesting work in that one. might be a good idea to move all the create_worker logic over into worker for all the backend types at some point?

@jberkhahn jberkhahn closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants