Refactor code according to upstream changes #62

tdoublep · 2025-01-17T13:38:50Z

This PR reworks our code according to some important upstream changes. In particular, there is no longer any need to have a separate SpyreExecutor and MultiprocessingSpyreExecutor. Upstream has added generic classes for this that work across different platforms. Acutally, it simplifies our code quite a lot.

The model runner classes now inherit from ModelRunnerBase and we need to define a ModelInputForSpyre class accordingly.

This is current passing all CPU tests, but needs to be tested on Spyre and needs careful review since it quite a big change.

Note: the target for this PR is a branch upstream-2025-01-17 containing upstream changes merged into our current branch. I've done it like this so it is easier to review the changes. If this PR is approved, we can then merge the changes into upstream-2025-01-17 and then merge that one into main.

github-actions · 2025-01-17T13:39:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-01-17T20:37:15Z

Still needs some work; moved back to draft.

### sequence level processing -> batch level processing In this PR the code for preparing the input tensors for the AIU is completely rewritten based on the assumption that we have to finish the current decoding on AIU before doing another prefill. Changes: * [rewriting](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/cea122c220b18e3de3dce95faa5e03fe3efe0835) `sendnn_model_runner.py`, `sendnn_worker.py` and `sendnn.py` based on the above constraint. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/6869231d83734d3c03ffd15bc6754c1857d063cc) class variable `self._padded_batch_size` since other solution implemented * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/ff9ebf6923fd9ac6c99e64dfffc7763f6c194399) the unused `input_block_ids` since AIU does not support paged attention yet. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/a6d63899bf3d9fae59edde414b8bd2a3c56bc8c7) some unused function arguments in model loading * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/4527300ee9be4dd1fb76007fb6e0862b97d51676) unused function _get_model_architecture() and global variable `_SENDNN_SUPPORTED_MODELS` The code has been tested in client/server mode for the `llama 194m` and `granite 3b` on `AIU` and `CPU`.

Signed-off-by: Thomas Parnell <[email protected]>

maxdebayser · 2025-01-21T16:51:11Z

All the changes make sense to me. I'll run one of the embedding benchmarks to validate the embedding model branch of the changes.

maxdebayser · 2025-01-21T17:57:05Z

I'm getting the same results:

	SciFact	ArguAna	FiQA2018
sentence-transformers/all-MiniLM-L12-v2	0.59364	0.4772	0.31501

maxdebayser

LGTM

yannicks1

great work! I added some minor comments, but LGTM overall.

Tested on CPU with inductor compilation using the tests in tests/spyre for llama194m and gpt 3b. On Spyre the online mode for grantite 3b, 8b and 13b have been tested successfully.

yannicks1 · 2025-01-24T12:19:59Z

vllm/worker/spyre_worker.py

+                                                 self.parallel_config,
+                                                 self.scheduler_config,
+                                                 self.device_config,
+                                                 self.is_driver_worker)
        self._env_initialized = False

    def init_distributed_environment(self) -> None:


Would it make sense to rename that function to better distinguish between the imported function init_distributed_environment from vllm.distributed and our own function self.init_distributed_environment defined in line 62?

Yes, seems reasonable to me.

addressed here

yannicks1 · 2025-01-24T12:21:31Z

vllm/worker/spyre_worker.py

@@ -86,11 +74,6 @@ def init_distributed_environment(self) -> None:
        # A small all_reduce for warmup.
        torch.distributed.all_reduce(torch.zeros(1).cpu())


out of curiosity: why is that needed here?

It probably isn't strictly needed, it's just like a verification check that the distributed setup it working.

yannicks1 · 2025-01-24T14:46:31Z

unrelated to this PR: The batch size argument in the load_model function here is unused for both SpyreModelRunner (here) SpyreEmbeddingModelRunner (here). It could be removed for better code readability (tested this). Should I do it @tdoublep ?

tdoublep · 2025-01-24T15:17:06Z

Should I do it @tdoublep ?

@yannicks1 Yes please do that, as a separate PR. Thanks!

yannicks1 · 2025-01-24T16:08:08Z

okay, I will address the function renaming and argument pruning in a separate PR after we merged this PR. From my side this PR is ready to be merged!

sducouedic

LGTM

yannicks1 · 2025-01-27T14:52:24Z

@tdoublep @sducouedic I addressed my to minor comments in this PR which should be merged BEFORE merging this PR.

jvlunteren

Looks good.

this PR improves code readability by - [commit 1](0c9bec0): removing the batch size argument in the load_model function [here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_worker.py#L135) since unused in `SpyreModelRunner` ([here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_model_runner.py#L104)) as well as in `SpyreEmbeddingModelRunner` ([here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_embedding_model_runner.py#L52)). - -- _[commit 2](beadbd5): function renaming to better distinguish between the imported function `init_distributed_environment` from `vllm.distributed` and our own function `self.init_distributed_environment`._ -- (**reverted since this seems to be a convention introduced in other worker classes**) --------- Signed-off-by: Yannick Schnider <[email protected]>

tdoublep added 8 commits January 17, 2025 21:02

Adapt SpyreWorker to some changes from upstream.

8dd2228

Signed-off-by: Thomas Parnell <[email protected]>

Working on CPU in TP

82bda5c

Signed-off-by: Thomas Parnell <[email protected]>

Resolve issue with TP=1

bef766f

Signed-off-by: Thomas Parnell <[email protected]>

Working on TP>1; there is a dynamic shape bug though

a0d55de

Signed-off-by: Thomas Parnell <[email protected]>

Dynamic shapes works too

20f5565

Signed-off-by: Thomas Parnell <[email protected]>

Set env vars in CI test

9f973d1

Signed-off-by: Thomas Parnell <[email protected]>

Remove debug prints

3209382

Signed-off-by: Thomas Parnell <[email protected]>

Remove unused code

3aa8196

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep force-pushed the tpa-executor branch from 2e1758d to 3aa8196 Compare January 17, 2025 20:04

tdoublep marked this pull request as ready for review January 17, 2025 20:06

fmt

a3af854

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested review from maxdebayser and jvlunteren January 17, 2025 20:10

tdoublep marked this pull request as draft January 17, 2025 20:36

tdoublep force-pushed the main branch from b9d944b to 2cd4d08 Compare January 20, 2025 10:33

tdoublep force-pushed the main branch from 2cd4d08 to 7e068b5 Compare January 20, 2025 10:35

tdoublep changed the base branch from main to upstream-2025-01-17 January 20, 2025 10:41

tdoublep added 3 commits January 21, 2025 12:45

Fix embedding model runner + fmt

5510021

Signed-off-by: Thomas Parnell <[email protected]>

Remove comment

6e8de27

Signed-off-by: Thomas Parnell <[email protected]>

Pass template argument

9703645

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep marked this pull request as ready for review January 21, 2025 12:54

enable CI tests on every PR

124f3a9

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested a review from yannicks1 January 21, 2025 14:28

maxdebayser approved these changes Jan 21, 2025

View reviewed changes

tdoublep requested a review from sducouedic January 21, 2025 21:29

yannicks1 reviewed Jan 24, 2025

View reviewed changes

yannicks1 approved these changes Jan 24, 2025

View reviewed changes

sducouedic approved these changes Jan 27, 2025

View reviewed changes

jvlunteren approved these changes Jan 28, 2025

View reviewed changes

tdoublep force-pushed the tpa-executor branch from 857b127 to 85a8d55 Compare January 28, 2025 18:58

tdoublep merged commit 310468c into upstream-2025-01-17 Jan 28, 2025
3 checks passed

tdoublep deleted the tpa-executor branch January 28, 2025 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor code according to upstream changes #62

Refactor code according to upstream changes #62

tdoublep commented Jan 17, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 17, 2025

tdoublep commented Jan 17, 2025

maxdebayser commented Jan 21, 2025

maxdebayser commented Jan 21, 2025

maxdebayser left a comment

yannicks1 left a comment

yannicks1 Jan 24, 2025

tdoublep Jan 24, 2025

yannicks1 Jan 27, 2025

yannicks1 Jan 24, 2025

tdoublep Jan 24, 2025

yannicks1 commented Jan 24, 2025

tdoublep commented Jan 24, 2025 •

edited

Loading

yannicks1 commented Jan 24, 2025

sducouedic left a comment

yannicks1 commented Jan 27, 2025 •

edited

Loading

jvlunteren left a comment

		@@ -86,11 +74,6 @@ def init_distributed_environment(self) -> None:
		# A small all_reduce for warmup.
		torch.distributed.all_reduce(torch.zeros(1).cpu())

Refactor code according to upstream changes #62

Refactor code according to upstream changes #62

Conversation

tdoublep commented Jan 17, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 17, 2025

tdoublep commented Jan 17, 2025

maxdebayser commented Jan 21, 2025

maxdebayser commented Jan 21, 2025

maxdebayser left a comment

Choose a reason for hiding this comment

yannicks1 left a comment

Choose a reason for hiding this comment

yannicks1 Jan 24, 2025

Choose a reason for hiding this comment

tdoublep Jan 24, 2025

Choose a reason for hiding this comment

yannicks1 Jan 27, 2025

Choose a reason for hiding this comment

yannicks1 Jan 24, 2025

Choose a reason for hiding this comment

tdoublep Jan 24, 2025

Choose a reason for hiding this comment

yannicks1 commented Jan 24, 2025

tdoublep commented Jan 24, 2025 • edited Loading

yannicks1 commented Jan 24, 2025

sducouedic left a comment

Choose a reason for hiding this comment

yannicks1 commented Jan 27, 2025 • edited Loading

jvlunteren left a comment

Choose a reason for hiding this comment

tdoublep commented Jan 17, 2025 •

edited by github-actions bot

Loading

tdoublep commented Jan 24, 2025 •

edited

Loading

yannicks1 commented Jan 27, 2025 •

edited

Loading