-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamaindex vllm error #12955
Comments
Hey @union-cmd! 🚀 Good to see you back here. I hope you're doing awesome. I'm diving into your issue now and will be back with an answer in a jiffy! 🦙 |
I have already downloaded the package 'llama-index-llms-vllm' using pip install, but I am still encountering this error. |
@union-cmd that's the wrong import Should be
|
I followed the official documentation to write this, so you might need to adjust it. |
It appears there's a discrepancy between the documentation you followed and the correct import path for from llama_index.llms.vllm import VllmServer Ensure your code reflects this change. If you're still facing issues, here are a few steps to troubleshoot:
If the problem persists, consider providing more details about your setup, including the Python version, operating system, and the output of Sources
|
Feel free to make a pr :) |
Hi I'm working on similar use case doing a rag based on knwoledge graph using vllmserver, I got this error:KeyError Traceback (most recent call last) 11 frames KeyError: 'text' Could you help me to figure this out |
Hello, I got the same Error. Turns out, the url is not correct, it shoult be "http://localhost:8000/v1/completions". The vllm server backend raised a 404 error. The code has not check the error, and continue excute, so no data was found, hence the error. |
I got the same Error |
I've fixed this bug, but it's not the optimal solution:
Step 1: Rewrite the vllmserver # Rewrite the vllmserver
from typing import Any
from openai import OpenAI
from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.llms.vllm import VllmServer
class VllmCustomServer(VllmServer):
@llm_completion_callback()
def complete(
self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
client = OpenAI(
base_url=self.api_url,
api_key="EMPTY",
)
kwargs = kwargs if kwargs else {}
params = {**self._model_kwargs, **kwargs}
# build sampling parameters
sampling_params = dict(**params)
sampling_params["prompt"] = prompt
completion = client.chat.completions.create(
model=self.model,
messages=[
{"role": "user", "content": prompt}
]
)
output = completion.choices[0].message
return CompletionResponse(text=output.content)
@llm_completion_callback()
def stream_complete(
self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponseGen:
client = OpenAI(
base_url=self.api_url,
api_key="EMPTY",
)
kwargs = kwargs if kwargs else {}
params = {**self._model_kwargs, **kwargs}
sampling_params = dict(**params)
sampling_params["prompt"] = prompt
completion = client.chat.completions.create(
model=self.model,
messages=[
{"role": "user", "content": prompt}
],
stream=True
)
def gen() -> CompletionResponseGen:
response_str = ""
prev_prefix_len = len(prompt)
for chunk in completion:
if chunk:
data = chunk.choices[0].delta.content
if data:
increasing_concat = data
pref = prev_prefix_len
prev_prefix_len = len(increasing_concat)
yield CompletionResponse(
text=increasing_concat, delta=increasing_concat[pref:]
)
return gen() Step2: customize a retrieval process # Core code as follows, you should adapt to your own projects
retriver = index.as_retriever(similarity_top_k=4)
nodes_with_score = retriver.retrieve(prompt)
context_str = format_source(nodes_with_score)
prompt_str = f"""你是一个有用的助手,根据上下文回答问题
用户的问题是:
-------------------问题内容-----------------
{prompt}
-----------------问题内容结束----------------
下面是可以参考的背景信息。
-------------------信息内容-----------------
{context_str}
-----------------信息内容结束----------------
"""
if stream:
return llm.stream_complete(prompt_str)
else:
return llm.complete(prompt_str) |
Can I check this fix is for which version? I am currently having issues performing inference with my vllm server, posted in this thread: |
I have the same problem. The thing is (or at least in my case), that the model is missing from the sampling_params. @property
def _model_kwargs(self) -> Dict[str, Any]:
base_kwargs = {
"model": self.model,
"temperature": self.temperature,
"max_tokens": self.max_new_tokens,
"n": self.n,
"frequency_penalty": self.frequency_penalty,
"presence_penalty": self.presence_penalty,
"use_beam_search": self.use_beam_search,
"best_of": self.best_of,
"ignore_eos": self.ignore_eos,
"stop": self.stop,
"logprobs": self.logprobs,
"top_k": self.top_k,
"top_p": self.top_p,
}
return {**base_kwargs} |
On your client side, look in llms/vllm/utils.py: def get_response(response: requests.Response) -> List[str]: Add an extra print for debug: def get_response(response: requests.Response) -> List[str]: The message returned from the server should help debug the problem. Please post your results. |
I created a fork with a possible fix for this issue. Can someone test it using their application ? See https://github.com/doscherda/vllm_llama_index/ The specific fix is at: doscherda@f640306 |
Feature Description
from llama_index.core.llms.vllm import VllmServer
from llama_index.core.llms import ChatMessage
llm = VllmServer(api_url="http://localhost:8000", max_new_tokens=8000, temperature=0)
res = llm.complete("hello")
print(res)
I executed the script combining the 'llamaindex' and 'vllm' above and encountered the following error.
ModuleNotFoundError: No module named 'llama_index.core.llms.vllm'
Reason
No response
Value of Feature
No response
The text was updated successfully, but these errors were encountered: