-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added gemma2 9b and 27b vllm with streaming #318
Conversation
vllm's next release will add support for gemma2 9/27B. Until then you'd have to build from source on top of a pytorch image which takes 30+ minutes to deploy. vllm-project/vllm#5806 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small suggestions, but looking forward to having 27b, folks are looking forward to it.
logger.info(f"tensor parallelism: {model_metadata['tensor_parallel']}") | ||
logger.info(f"max num seqs: {model_metadata['max_num_seqs']}") | ||
|
||
self.model_args = AsyncEngineArgs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential improvement is to move everything to config, e.g. as in this example:
https://github.com/vshulman/truss-examples/tree/main/ultravox-vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can merge without this change as other vLLM examples also support a partial list of arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through it, it looks like that example uses the vllm openai server instead of explicitly instantiating the vllm AsyncLLMEngine for the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% -- I just think the same kwargs pattern can apply here. the benefit I see is that going forward all it would take to pass a new argument into vLLM, either the standalone OpenAI server or the Python API above, is adding it to the config.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info(f"tensor parallelism: {model_metadata['tensor_parallel']}") | ||
logger.info(f"max num seqs: {model_metadata['max_num_seqs']}") | ||
|
||
self.model_args = AsyncEngineArgs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% -- I just think the same kwargs pattern can apply here. the benefit I see is that going forward all it would take to pass a new argument into vLLM, either the standalone OpenAI server or the Python API above, is adding it to the config.yaml.
… template. todo: confirm 27B working
No description provided.