-
Notifications
You must be signed in to change notification settings - Fork 445
Changes LLMRayActor to use vllm's AsyncLLMEngine #1016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tures are complete.
…list[CompletionOutput].
…ration The issue was that after a tool call, we would loop back and try to generate again with the same sub_request_id. This caused vLLM to reject or hang on the duplicate request ID. Now we append '_iterN' to create unique IDs for each generation attempt within the same request.
refs = [ | ||
engine.update_weight.remote( | ||
name, dtype=param.dtype, shape=shape, empty_cache=count == num_params | ||
name, dtype=str(param.dtype), shape=shape, empty_cache=count == num_params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to support the async engine's serialization, which for some unknown reason, goes through MessagePack, and not pickle.
|
||
assert dtype == self.model_config.dtype, f"mismatch dtype: src {dtype}, dst {self.model_config.dtype}" | ||
weight = torch.empty(shape, dtype=dtype, device="cuda") | ||
assert dtype == str(self.model_config.dtype), ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed earlier, this is done because of weird serialization format in vLLM's async engine.
By moving to the async engine, we have a much cleaner central processing loop, as we process each request in a dedicated function, relying on vllm to handle the batch processing.
SGLang and TensorRT-LLM both use async engines, so by moving to VLLM's async API, it will be much easier to test out those engines.
Runs with
inflight_updates=True
:And with
inflight_updates=False
: