Updated inline vllm inference provider #880

frreiss · 2025-01-26T03:17:41Z

What does this PR do?

This PR updates the inline vLLM inference provider in several significant ways:

Models are now attached at run time to instances of the provider via the .../models API instead of hard-coding the model's full name into the provider's YAML configuration.
The provider supports models that are not Meta Llama models. Any model that vLLM supports can be loaded by passing Huggingface coordinates in the "provider_model_id" field. Custom fine-tuned versions of Meta Llama models can be loaded by specifying a path on local disk in the "provider_model_id".
To implement full chat completions support, including tool calling and constrained decoding, the provider now routes the chat_completions API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server .
The logprobs parameter and completions API are also working.

Test Plan

Existing tests in llama_stack/providers/tests/inference/test_text_inference.py have good coverage of the new functionality. These tests can be invoked as follows:

cd llama-stack && pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                               

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]

=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================

Sources

Before submitting

Ran pre-commit to handle lint / formatting issues.
Read the contributor guideline,
Pull Request section?
Updated relevant documentation.
Wrote necessary unit or integration tests.

ashwinb

Wonderful PR, thank you!

I have a few comments inline.

ashwinb · 2025-01-28T13:29:11Z

llama_stack/providers/inline/inference/vllm/openai_utils.py

+        "messages": converted_messages,
+        "tools": converted_tools,
+        "tool_choice": converted_tool_choice,
+        "stream": request.stream,


nit: a bit more idiomatic python to write

request_options = { "model": ..., **sampling_options, **guided_decoding_options, **logprob_options }

ashwinb · 2025-01-28T13:29:42Z

llama_stack/providers/inline/inference/vllm/openai_utils.py

+    # OpenAI's APIs don't know about.
+    # vLLM's OpenAI-compatible API also handles repetition penalties wrong.
+    # For now, translate repetition penalties into a format that vLLM's broken
+    # API will handle correctly. Two wrongs make a right...


ashwinb · 2025-01-28T13:33:00Z

llama_stack/providers/inline/inference/vllm/openai_utils.py

+    ):
+        converted_tool_choice = "auto"
+
+    # TODO: Figure out what to do with the tool_prompt_format argument.


so this is rather important actually when the underlying model is a llama model. See https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/inference/prompt_adapter.py#L286-L297 for how we try to adapt the tool formatting to the underlying llama model. each llama model is a special snowflake :/

my recommendation therefore is to treat llama models specially when routing to vLLM. when you detect a model is a llama model (we use metadata.llama_model from the model registration info elsewhere for this purpose), you should route it to the "raw" completions API and keep control of prompt formatting within the Stack. otherwise, you can invoke the path you have implemented.

let me know your thoughts.

I can certainly take that route for now.

It would be good in the longer term to have the different Llama model tool formats fully integrated into the vLLM engine. That way systems that use a vLLM-only inference stack will see consistent results with Llama Stack.

ashwinb · 2025-01-28T13:33:26Z