Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added gemma2 9b and 27b vllm with streaming #318

Merged
merged 11 commits into from
Jul 16, 2024
Merged

Conversation

dsingal0
Copy link
Contributor

@dsingal0 dsingal0 commented Jul 2, 2024

No description provided.

@dsingal0
Copy link
Contributor Author

dsingal0 commented Jul 2, 2024

vllm's next release will add support for gemma2 9/27B. Until then you'd have to build from source on top of a pytorch image which takes 30+ minutes to deploy. vllm-project/vllm#5806

@dsingal0 dsingal0 changed the title added gemma2 9b and 27b with streaming using local-gemma added gemma2 9b and 27b with streaming Jul 7, 2024
Copy link
Contributor

@vshulman vshulman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small suggestions, but looking forward to having 27b, folks are looking forward to it.

logger.info(f"tensor parallelism: {model_metadata['tensor_parallel']}")
logger.info(f"max num seqs: {model_metadata['max_num_seqs']}")

self.model_args = AsyncEngineArgs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential improvement is to move everything to config, e.g. as in this example:
https://github.com/vshulman/truss-examples/tree/main/ultravox-vllm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can merge without this change as other vLLM examples also support a partial list of arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through it, it looks like that example uses the vllm openai server instead of explicitly instantiating the vllm AsyncLLMEngine for the model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% -- I just think the same kwargs pattern can apply here. the benefit I see is that going forward all it would take to pass a new argument into vLLM, either the standalone OpenAI server or the Python API above, is adding it to the config.yaml.

gemma2/gemma2-27b-it/model/model.py Outdated Show resolved Hide resolved
gemma2/gemma2-27b-it/config.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@vshulman vshulman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this pattern is expected? It looks like the instruct response completes part of the request (! 😀), which makes me think there is potentially a templating issue?

image

gemma2/gemma2-27b-it-vllm/model/model.py Outdated Show resolved Hide resolved
logger.info(f"tensor parallelism: {model_metadata['tensor_parallel']}")
logger.info(f"max num seqs: {model_metadata['max_num_seqs']}")

self.model_args = AsyncEngineArgs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% -- I just think the same kwargs pattern can apply here. the benefit I see is that going forward all it would take to pass a new argument into vLLM, either the standalone OpenAI server or the Python API above, is adding it to the config.yaml.

@dsingal0
Copy link
Contributor Author

Do you think this pattern is expected? It looks like the instruct response completes part of the request (! 😀), which makes me think there is potentially a templating issue?

image

I believe this is because the vllm implementation didn't use the chat template. I'm not familiar with that, but will give it a try.

@dsingal0 dsingal0 changed the title added gemma2 9b and 27b with streaming [DRAFT]added gemma2 9b and 27b with streaming Jul 15, 2024
@dsingal0 dsingal0 changed the title [DRAFT]added gemma2 9b and 27b with streaming added gemma2 9b and 27b vllm with streaming Jul 16, 2024
@dsingal0 dsingal0 merged commit a95eeee into basetenlabs:main Jul 16, 2024
2 checks passed
@dsingal0 dsingal0 deleted the gemma2 branch July 16, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants