llamaindex vllm error #12955

union-cmd · 2024-04-19T03:50:52Z

Feature Description

from llama_index.core.llms.vllm import VllmServer
from llama_index.core.llms import ChatMessage

llm = VllmServer(api_url="http://localhost:8000", max_new_tokens=8000, temperature=0)
res = llm.complete("hello")
print(res)

I executed the script combining the 'llamaindex' and 'vllm' above and encountered the following error.

from llama_index.core.llms.vllm import VllmServer

ModuleNotFoundError: No module named 'llama_index.core.llms.vllm'

Reason

No response

Value of Feature

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-04-19T03:51:09Z

Hey @union-cmd! 🚀 Good to see you back here. I hope you're doing awesome. I'm diving into your issue now and will be back with an answer in a jiffy! 🦙

union-cmd · 2024-04-19T03:54:43Z

I have already downloaded the package 'llama-index-llms-vllm' using pip install, but I am still encountering this error.

logan-markewich · 2024-04-19T03:58:32Z

@union-cmd that's the wrong import

Should be

from llama_index.llms.vllm import VllmServer

union-cmd · 2024-04-19T04:01:12Z

I followed the official documentation to write this, so you might need to adjust it.
https://docs.llamaindex.ai/en/stable/examples/llm/vllm/

dosubot · 2024-04-19T04:12:22Z

It appears there's a discrepancy between the documentation you followed and the correct import path for VllmServer. The correct import statement should be:

from llama_index.llms.vllm import VllmServer

Ensure your code reflects this change. If you're still facing issues, here are a few steps to troubleshoot:

Verify the llama_index package is installed with pip install llama_index.
Check your Python environment to ensure it includes the llama_index package. You can do this by printing sys.path.
Ensure there are no naming conflicts in your working directory, such as files or directories named llama_index.
If using a Jupyter notebook, try restarting the kernel after installing new packages.

If the problem persists, consider providing more details about your setup, including the Python version, operating system, and the output of pip show llama_index. This will help in diagnosing the issue more accurately.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich · 2024-04-19T04:15:45Z

Feel free to make a pr :)

Manel-Hik · 2024-04-19T11:49:36Z

Hi I'm working on similar use case doing a rag based on knwoledge graph using vllmserver,
where I defined my vllmserver like this:
llm = VllmServer(
api_url=my_api_url
model=model_name,
max_new_tokens=1024,
temperature=0,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
)
after defining the index using the knowledgegraphindex like this
#Construct the Knowlege Graph Index it can take a while
index = KnowledgeGraphIndex.from_documents(
documents,
max_triplets_per_chunk=3,
storage_context=storage_context,
include_embeddings=True# it governs the number of relationship triplets processed per data chunk
)

I got this error:

KeyError Traceback (most recent call last)
in <cell line: 2>()
1 #Construct the Knowlege Graph Index it can take a while
----> 2 index = KnowledgeGraphIndex.from_documents(
3 documents,
4 max_triplets_per_chunk=3,
5 storage_context=storage_context,

11 frames
/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm/utils.py in get_response(response)
7 def get_response(response: requests.Response) -> List[str]:
8 data = json.loads(response.content)
----> 9 return data["text"]
10
11

KeyError: 'text'

Could you help me to figure this out
I'm using the Version: 0.10.30 of llamaindex
Thanks in advance

buaaflyaway · 2024-06-12T01:35:13Z

Hi I'm working on similar use case doing a rag based on knwoledge graph using vllmserver, where I defined my vllmserver like this: llm = VllmServer( api_url=my_api_url model=model_name, max_new_tokens=1024, temperature=0, messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, ) after defining the index using the knowledgegraphindex like this #Construct the Knowlege Graph Index it can take a while index = KnowledgeGraphIndex.from_documents( documents, max_triplets_per_chunk=3, storage_context=storage_context, include_embeddings=True# it governs the number of relationship triplets processed per data chunk )

I got this error:

KeyError Traceback (most recent call last) in <cell line: 2>() 1 #Construct the Knowlege Graph Index it can take a while ----> 2 index = KnowledgeGraphIndex.from_documents( 3 documents, 4 max_triplets_per_chunk=3, 5 storage_context=storage_context,

11 frames /usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm/utils.py in get_response(response) 7 def get_response(response: requests.Response) -> List[str]: 8 data = json.loads(response.content) ----> 9 return data["text"] 10 11

KeyError: 'text'

Could you help me to figure this out I'm using the Version: 0.10.30 of llamaindex Thanks in advance

Hello, I got the same Error. Turns out, the url is not correct, it shoult be "http://localhost:8000/v1/completions". The vllm server backend raised a 404 error. The code has not check the error, and continue excute, so no data was found, hence the error.
When using "http://localhost:8000/v1/completions" as api url, I got a 400 Bad Request error. Don't know why. Still fixing

Vincewz · 2024-06-17T08:01:10Z

I got the same Error

buaaflyaway · 2024-06-19T00:48:12Z

I've fixed this bug, but it's not the optimal solution:

I rewrote VllmServer to use the "http://localhost:8000/v1/completions" endpoint instead of "generate".
I tried to directly retrieve the generation using the rewritten completed class, but the order of the tokens was messed up, possibly related to an incomplete rewrite.
Therefore, I attempted to customize a retrieval process, and that made it work.

Step 1: Rewrite the vllmserver

# Rewrite the vllmserver
from typing import Any
from openai import OpenAI
from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.llms.vllm import VllmServer


class VllmCustomServer(VllmServer):
    @llm_completion_callback()
    def complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        # build sampling parameters
        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        output = completion.choices[0].message
        return CompletionResponse(text=output.content)

    @llm_completion_callback()
    def stream_complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponseGen:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            stream=True
        )

        def gen() -> CompletionResponseGen:
            response_str = ""
            prev_prefix_len = len(prompt)
            for chunk in completion:
                if chunk:
                    data = chunk.choices[0].delta.content
                    if data:
                        increasing_concat = data
                        pref = prev_prefix_len
                        prev_prefix_len = len(increasing_concat)
                        yield CompletionResponse(
                            text=increasing_concat, delta=increasing_concat[pref:]
                        )

        return gen()

Step2: customize a retrieval process

# Core code as follows, you should adapt to your own projects
retriver = index.as_retriever(similarity_top_k=4)
nodes_with_score = retriver.retrieve(prompt)
context_str = format_source(nodes_with_score)
prompt_str = f"""你是一个有用的助手，根据上下文回答问题
用户的问题是：
-------------------问题内容-----------------
{prompt}
-----------------问题内容结束----------------
        
下面是可以参考的背景信息。
-------------------信息内容-----------------
{context_str}
-----------------信息内容结束----------------
"""
        if stream:
            return llm.stream_complete(prompt_str)
        else:
            return llm.complete(prompt_str)

xKwan · 2024-06-27T05:34:07Z

I've fixed this bug, but it's not the optimal solution:

I rewrote VllmServer to use the "http://localhost:8000/v1/completions" endpoint instead of "generate".
I tried to directly retrieve the generation using the rewritten completed class, but the order of the tokens was messed up, possibly related to an incomplete rewrite.
Therefore, I attempted to customize a retrieval process, and that made it work.

Step 1: Rewrite the vllmserver

# Rewrite the vllmserver
from typing import Any
from openai import OpenAI
from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.llms.vllm import VllmServer


class VllmCustomServer(VllmServer):
    @llm_completion_callback()
    def complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        # build sampling parameters
        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        output = completion.choices[0].message
        return CompletionResponse(text=output.content)

    @llm_completion_callback()
    def stream_complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponseGen:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            stream=True
        )

        def gen() -> CompletionResponseGen:
            response_str = ""
            prev_prefix_len = len(prompt)
            for chunk in completion:
                if chunk:
                    data = chunk.choices[0].delta.content
                    if data:
                        increasing_concat = data
                        pref = prev_prefix_len
                        prev_prefix_len = len(increasing_concat)
                        yield CompletionResponse(
                            text=increasing_concat, delta=increasing_concat[pref:]
                        )

        return gen()

Step2: customize a retrieval process

# Core code as follows, you should adapt to your own projects
retriver = index.as_retriever(similarity_top_k=4)
nodes_with_score = retriver.retrieve(prompt)
context_str = format_source(nodes_with_score)
prompt_str = f"""你是一个有用的助手，根据上下文回答问题
用户的问题是：
-------------------问题内容-----------------
{prompt}
-----------------问题内容结束----------------
        
下面是可以参考的背景信息。
-------------------信息内容-----------------
{context_str}
-----------------信息内容结束----------------
"""
        if stream:
            return llm.stream_complete(prompt_str)
        else:
            return llm.complete(prompt_str)

Can I check this fix is for which version? I am currently having issues performing inference with my vllm server, posted in this thread:
#14420

kobiche · 2024-07-15T20:19:57Z

I have the same problem. The thing is (or at least in my case), that the model is missing from the sampling_params.

    @property
    def _model_kwargs(self) -> Dict[str, Any]:
        base_kwargs = {
            "model": self.model,
            "temperature": self.temperature,
            "max_tokens": self.max_new_tokens,
            "n": self.n,
            "frequency_penalty": self.frequency_penalty,
            "presence_penalty": self.presence_penalty,
            "use_beam_search": self.use_beam_search,
            "best_of": self.best_of,
            "ignore_eos": self.ignore_eos,
            "stop": self.stop,
            "logprobs": self.logprobs,
            "top_k": self.top_k,
            "top_p": self.top_p,
        }
        return {**base_kwargs}

doscherda · 2024-08-16T01:34:46Z

On your client side, look in llms/vllm/utils.py:

def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
return data["text"]

Add an extra print for debug:

def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
print("RESPONSE DATA IS: ", data)
return data["text"]

The message returned from the server should help debug the problem.
I suspect you will see a message that the "model" keyword is required but is missing.

Please post your results.

doscherda · 2024-08-16T15:32:48Z

I created a fork with a possible fix for this issue. Can someone test it using their application ? See https://github.com/doscherda/vllm_llama_index/ The specific fix is at: doscherda@f640306

union-cmd added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Apr 19, 2024

doscherda mentioned this issue Aug 16, 2024

[Bug]: "POST /generate HTTP/1.1" 404 Not Found #13041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamaindex vllm error #12955

llamaindex vllm error #12955

union-cmd commented Apr 19, 2024

dosubot bot commented Apr 19, 2024

union-cmd commented Apr 19, 2024

logan-markewich commented Apr 19, 2024

union-cmd commented Apr 19, 2024

dosubot bot commented Apr 19, 2024

About Dosu

logan-markewich commented Apr 19, 2024

Manel-Hik commented Apr 19, 2024 •

edited

Loading

buaaflyaway commented Jun 12, 2024

I got this error:

Vincewz commented Jun 17, 2024

buaaflyaway commented Jun 19, 2024 •

edited

Loading

xKwan commented Jun 27, 2024 •

edited

Loading

kobiche commented Jul 15, 2024

doscherda commented Aug 16, 2024

doscherda commented Aug 16, 2024

llamaindex vllm error #12955

llamaindex vllm error #12955

Comments

union-cmd commented Apr 19, 2024

Feature Description

Reason

Value of Feature

dosubot bot commented Apr 19, 2024

union-cmd commented Apr 19, 2024

logan-markewich commented Apr 19, 2024

union-cmd commented Apr 19, 2024

dosubot bot commented Apr 19, 2024

Sources

About Dosu

logan-markewich commented Apr 19, 2024

Manel-Hik commented Apr 19, 2024 • edited Loading

I got this error:

buaaflyaway commented Jun 12, 2024

I got this error:

Vincewz commented Jun 17, 2024

buaaflyaway commented Jun 19, 2024 • edited Loading

xKwan commented Jun 27, 2024 • edited Loading

kobiche commented Jul 15, 2024

doscherda commented Aug 16, 2024

doscherda commented Aug 16, 2024

Manel-Hik commented Apr 19, 2024 •

edited

Loading

buaaflyaway commented Jun 19, 2024 •

edited

Loading

xKwan commented Jun 27, 2024 •

edited

Loading