-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor code according to upstream changes #62
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
2e1758d
to
3aa8196
Compare
Signed-off-by: Thomas Parnell <[email protected]>
Still needs some work; moved back to draft. |
### sequence level processing -> batch level processing In this PR the code for preparing the input tensors for the AIU is completely rewritten based on the assumption that we have to finish the current decoding on AIU before doing another prefill. Changes: * [rewriting](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/cea122c220b18e3de3dce95faa5e03fe3efe0835) `sendnn_model_runner.py`, `sendnn_worker.py` and `sendnn.py` based on the above constraint. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/6869231d83734d3c03ffd15bc6754c1857d063cc) class variable `self._padded_batch_size` since other solution implemented * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/ff9ebf6923fd9ac6c99e64dfffc7763f6c194399) the unused `input_block_ids` since AIU does not support paged attention yet. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/a6d63899bf3d9fae59edde414b8bd2a3c56bc8c7) some unused function arguments in model loading * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/4527300ee9be4dd1fb76007fb6e0862b97d51676) unused function _get_model_architecture() and global variable `_SENDNN_SUPPORTED_MODELS` The code has been tested in client/server mode for the `llama 194m` and `granite 3b` on `AIU` and `CPU`.
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
All the changes make sense to me. I'll run one of the embedding benchmarks to validate the embedding model branch of the changes. |
I'm getting the same results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! I added some minor comments, but LGTM overall.
Tested on CPU with inductor compilation using the tests in tests/spyre
for llama194m and gpt 3b. On Spyre the online mode for grantite 3b, 8b and 13b have been tested successfully.
self.parallel_config, | ||
self.scheduler_config, | ||
self.device_config, | ||
self.is_driver_worker) | ||
self._env_initialized = False | ||
|
||
def init_distributed_environment(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to rename that function to better distinguish between the imported function init_distributed_environment
from vllm.distributed
and our own function self.init_distributed_environment
defined in line 62?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed here
@@ -86,11 +74,6 @@ def init_distributed_environment(self) -> None: | |||
# A small all_reduce for warmup. | |||
torch.distributed.all_reduce(torch.zeros(1).cpu()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity: why is that needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably isn't strictly needed, it's just like a verification check that the distributed setup it working.
@yannicks1 Yes please do that, as a separate PR. Thanks! |
okay, I will address the function renaming and argument pruning in a separate PR after we merged this PR. From my side this PR is ready to be merged! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tdoublep @sducouedic I addressed my to minor comments in this PR which should be merged BEFORE merging this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
this PR improves code readability by - [commit 1](0c9bec0): removing the batch size argument in the load_model function [here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_worker.py#L135) since unused in `SpyreModelRunner` ([here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_model_runner.py#L104)) as well as in `SpyreEmbeddingModelRunner` ([here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_embedding_model_runner.py#L52)). - -- _[commit 2](beadbd5): function renaming to better distinguish between the imported function `init_distributed_environment` from `vllm.distributed` and our own function `self.init_distributed_environment`._ -- (**reverted since this seems to be a convention introduced in other worker classes**) --------- Signed-off-by: Yannick Schnider <[email protected]>
857b127
to
85a8d55
Compare
This PR reworks our code according to some important upstream changes. In particular, there is no longer any need to have a separate
SpyreExecutor
andMultiprocessingSpyreExecutor
. Upstream has added generic classes for this that work across different platforms. Acutally, it simplifies our code quite a lot.The model runner classes now inherit from
ModelRunnerBase
and we need to define aModelInputForSpyre
class accordingly.This is current passing all CPU tests, but needs to be tested on Spyre and needs careful review since it quite a big change.
Note: the target for this PR is a branch
upstream-2025-01-17
containing upstream changes merged into our current branch. I've done it like this so it is easier to review the changes. If this PR is approved, we can then merge the changes intoupstream-2025-01-17
and then merge that one into main.