Begin refactoring executor_base ABC #9392

jberkhahn · 2024-10-15T22:24:20Z

This was started in support of mypy'ing the remaining libs that need it. It's been extremely difficult to add mypy to anything that touches the executor, because the various diffrent kinds of backends are structured differently, but everything is generally referred to via the abstract base type and then methods are called blindly because we "know" which kind of executor we're dealing with, even if the code doesn't. In addition, the various executor implementations often implement similar functionality decomposed differently - so the same bit of functionality often exists in different places or is referenced differently in different backends.

This is the beginning of a reactor that aims to create more abstract method declarations in executor_base, to allow executor code to be statically type checked, as well as to hopefully let things be structured in a more consistent manner that is easy to understand. This does occasionally mean that a particular backend will have a kind of dummy implementation that doesn't do much, as with the ray_tpu_executor here.

This PR just starts with the _create_worker method, which I've changed to init the driver worker, but not init the device or load the model across all various backend implementations.

github-actions · 2024-10-15T22:24:33Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac · 2024-10-16T00:54:33Z

vllm/executor/executor_base.py

@@ -50,6 +50,15 @@ def __init__(
    def _init_executor(self) -> None:
        pass

+    @abstractmethod
+    def _create_worker(self,


The function name starting with underline should be an internal function in general practice. I'm not sure if it's a suitable abstract function name?

many of the executors already have a _create_worker method, and it seems to be used internally, so I think this name fits?

@jberkhahn apologies I was looking at this again and I'm actually also unsure of why we would add this in the base class, since it's only ever used in a private context. There isn't any place where we call _create_worker on a generic executor.

declaring it here let's me force a standardized set of args to make sure all the _create_worker methods do roughly the same steps in the different executors. currently the separation of functionality is a bit different in different executors. I realize declaring a private method part of an interface is a bit unusual, there are spots where we do reach in and call private methods, so I'm going to have to declare those in the ABC at some point.

In this case does that make sense to just promote it to create_worker?

i could do that. the specific instance i'm referring to where stuff reaches in and calls private methods is stuff like the calls to _run_workers in vllm/engine/llm_engine.py

ok, i went and changed it to create_worker. (incidentally, cpu_executor has a now private method for boostrapping it's worker, which now makes more sense with this naming scheme)

@jberkhahn I'm sorry I still can't wrap my head around why we want this in the top level interface unless there's somewhere in the code that calls it on arbitrary executors (which could include other non-abstract methods in ExecutorBase).

Does this fix an existing typing issue?

It makes sense to standardize how things are done across the implementations but given the current scoping of what's the concern of each particular impl, it's not "wrong" for them to do this differently.

So concretely my suggestions would be to remove it here and keep the private _create_worker naming across the classes.

vllm/executor/ray_tpu_executor.py

joerunde · 2024-10-17T21:08:47Z

vllm/executor/ray_tpu_executor.py

        # The driver dummy worker does not actually use any resources.
        # It holds the resource for the driver worker.
-        self.driver_dummy_worker: Optional[RayWorkerWrapper] = None
-        # The remaining workers are the actual ray actors.
+        return None


This is... definitely confusing for me, as I have no other context about how this code works.

Should the code from _init_workers_ray really be here under create_worker since that's what actually creates the driver_dummy_worker?

woops, left a bit on the cutting room floor here, sorry. This is one of the parts that is kind of janky, tho. The ray tpu doesn't have a driver worker, so it just sets it to None, which I extracted into the _create_worker method to have an implementation.

@jberkhahn the changes in this file also don't make sense to me. _create_worker is implemented in the TPUExecutor superclass, there's no need to override it here. It's just not used in this class.

Also I don't see what the benefit is in changing this line:

self.driver_dummy_worker: Optional[RayWorkerWrapper] = None

it seems clearer as it is currently. Compare with the RayGPUExecutor class hierarchy which is similar.

ok, i went ahead and removed this bit, it wasn't making things clearer

mergify · 2024-11-02T14:36:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. @jberkhahn please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…od declaration Signed-off-by: jberkhahn <[email protected]>

Signed-off-by: jberkhahn <[email protected]>

jberkhahn · 2024-11-05T23:15:28Z

had to fix merge conflicts because of #9938, some interesting work in that one. might be a good idea to move all the create_worker logic over into worker for all the backend types at some point?

jberkhahn requested review from WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners October 15, 2024 22:24

jberkhahn force-pushed the refactor_executor branch from c703862 to dd4b28c Compare October 15, 2024 22:28

comaniac reviewed Oct 16, 2024

View reviewed changes

joerunde reviewed Oct 17, 2024

View reviewed changes

vllm/executor/ray_tpu_executor.py Outdated Show resolved Hide resolved

joerunde reviewed Oct 17, 2024

View reviewed changes

jberkhahn force-pushed the refactor_executor branch 5 times, most recently from 110ceeb to 90b46aa Compare October 24, 2024 23:12

jberkhahn force-pushed the refactor_executor branch 2 times, most recently from 2826f97 to 0fc499d Compare November 1, 2024 22:47

mergify bot added the needs-rebase label Nov 2, 2024

jberkhahn force-pushed the refactor_executor branch from 0fc499d to e349434 Compare November 5, 2024 22:06

mergify bot removed the needs-rebase label Nov 5, 2024

jberkhahn force-pushed the refactor_executor branch 3 times, most recently from 4411794 to 01a40cb Compare November 5, 2024 22:55

Refactor executor_base ABC to have contain unified create_worker meth…

3a1e3f4

…od declaration Signed-off-by: jberkhahn <[email protected]>

jberkhahn force-pushed the refactor_executor branch from 01a40cb to 3a1e3f4 Compare November 5, 2024 23:07

yapf refactor vllm/executor and appease ruff

4277988

Signed-off-by: jberkhahn <[email protected]>

jberkhahn closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Begin refactoring executor_base ABC #9392

Begin refactoring executor_base ABC #9392

jberkhahn commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

comaniac Oct 16, 2024

jberkhahn Oct 18, 2024

njhill Oct 29, 2024

jberkhahn Oct 31, 2024

comaniac Nov 1, 2024

jberkhahn Nov 1, 2024

jberkhahn Nov 1, 2024

njhill Nov 6, 2024

joerunde Oct 17, 2024

jberkhahn Oct 18, 2024

njhill Oct 23, 2024

jberkhahn Nov 1, 2024

mergify bot commented Nov 2, 2024

jberkhahn commented Nov 5, 2024

Begin refactoring executor_base ABC #9392

Begin refactoring executor_base ABC #9392

Conversation

jberkhahn commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Nov 2, 2024

jberkhahn commented Nov 5, 2024