-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated inline vllm inference provider #880
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful PR, thank you!
I have a few comments inline.
"messages": converted_messages, | ||
"tools": converted_tools, | ||
"tool_choice": converted_tool_choice, | ||
"stream": request.stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a bit more idiomatic python to write
request_options = {
"model": ...,
**sampling_options,
**guided_decoding_options,
**logprob_options
}
# OpenAI's APIs don't know about. | ||
# vLLM's OpenAI-compatible API also handles repetition penalties wrong. | ||
# For now, translate repetition penalties into a format that vLLM's broken | ||
# API will handle correctly. Two wrongs make a right... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
): | ||
converted_tool_choice = "auto" | ||
|
||
# TODO: Figure out what to do with the tool_prompt_format argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is rather important actually when the underlying model is a llama model. See https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/inference/prompt_adapter.py#L286-L297 for how we try to adapt the tool formatting to the underlying llama model. each llama model is a special snowflake :/
my recommendation therefore is to treat llama models specially when routing to vLLM. when you detect a model is a llama model (we use metadata.llama_model
from the model registration info elsewhere for this purpose), you should route it to the "raw" completions API and keep control of prompt formatting within the Stack. otherwise, you can invoke the path you have implemented.
let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can certainly take that route for now.
It would be good in the longer term to have the different Llama model tool formats fully integrated into the vLLM engine. That way systems that use a vLLM-only inference stack will see consistent results with Llama Stack.
import vllm.sampling_params | ||
|
||
############################################################################ | ||
# llama_models imports go here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think these comments are very useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove these comments.
############################################################################ | ||
# vLLM imports go here | ||
# | ||
# We deep-import the names that don't conflict with Llama Stack names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return None | ||
|
||
# Llama Stack currently implements fewer types of constrained | ||
# decoding than vLLM does. Translate the types that exist and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
vllm_top_p = 1.0 | ||
vllm_temperature = 0.0 | ||
|
||
# vLLM allows top-p and top-k at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha what is the implementation in that case? I wasn't aware this combination could make sense!
guided_decoding_backend="lm-format-enforcer", | ||
) | ||
########################################################################### | ||
# METHODS INHERITED FROM UNDOCUMENTED IMPLICIT MYSTERY BASE CLASS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehehe, legit feedback and taken! I will fix this by making a ProviderBase
and documenting the lifecycle properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating this comment accordingly.
if self.resolved_model_id is not None: | ||
if resolved_model_id != self.resolved_model_id: | ||
raise ValueError( | ||
f"Attempted to serve two LLMs (ids " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe your line width is rather small and isn't really idiomatic of how the rest of the Stack code looks like. Could I ask you to make this a bit wider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do. From the llama_reference provider code, it looks like the standard line width is 100. Is that correct?
Callback that is called when the server removes an inference endpoint | ||
from an inference provider. | ||
|
||
The semantics of this callback are not clear. How should model_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's fair. we will add appropriate documentation.
model_id
is the same ID you got in register_model()
.
the behavioral semantics are left up to the provider -- it basically means the Stack will no longer recognize this model and if the provider wants to do any resource deallocation (e.g., maybe they could send an API call to unwind a deployment, etc.) this callback is the place to initiate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that clarification. I'll update the pydoc comment and add some code here to remove the selected model ID and shut down the connection to vLLM if no more IDs are being served.
# first one. | ||
if len(vllm_result.choices) == 0: | ||
raise ValueError( | ||
"Don't know how to convert response object without any " "responses" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double quotes need to be escaped here
def _log(msg: str, level: str): | ||
if _BYPASS_LOGGING: | ||
time_str = datetime.datetime.now().strftime("%H:%M:%S") | ||
print(f"{time_str}: {msg}") | ||
match level: | ||
case "info": | ||
logger.info(msg) | ||
case "debug": | ||
logger.debug(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit hacky. We should probably fix and improve the logging separately
None if sampling_params.max_tokens == 0 else sampling_params.max_tokens | ||
), | ||
# Assume that vLLM's default stop token will work | ||
# stop_token_ids=[tokenizer.eos_token_id], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
What does this PR do?
This PR updates the inline vLLM inference provider in several significant ways:
.../models
API instead of hard-coding the model's full name into the provider's YAML configuration.chat_completions
API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server .logprobs
parameter and completions API are also working.Test Plan
Existing tests in
llama_stack/providers/tests/inference/test_text_inference.py
have good coverage of the new functionality. These tests can be invoked as follows:Sources
Before submitting
Pull Request section?