Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamaindex vllm error #12955

Open
union-cmd opened this issue Apr 19, 2024 · 14 comments
Open

llamaindex vllm error #12955

union-cmd opened this issue Apr 19, 2024 · 14 comments
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@union-cmd
Copy link

Feature Description

from llama_index.core.llms.vllm import VllmServer
from llama_index.core.llms import ChatMessage

llm = VllmServer(api_url="http://localhost:8000", max_new_tokens=8000, temperature=0)
res = llm.complete("hello")
print(res)

I executed the script combining the 'llamaindex' and 'vllm' above and encountered the following error.

from llama_index.core.llms.vllm import VllmServer

ModuleNotFoundError: No module named 'llama_index.core.llms.vllm'

Reason

No response

Value of Feature

No response

@union-cmd union-cmd added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Apr 19, 2024
Copy link

dosubot bot commented Apr 19, 2024

Hey @union-cmd! 🚀 Good to see you back here. I hope you're doing awesome. I'm diving into your issue now and will be back with an answer in a jiffy! 🦙

@union-cmd
Copy link
Author

I have already downloaded the package 'llama-index-llms-vllm' using pip install, but I am still encountering this error.

@logan-markewich
Copy link
Collaborator

@union-cmd that's the wrong import

Should be

from llama_index.llms.vllm import VllmServer

@union-cmd
Copy link
Author

I followed the official documentation to write this, so you might need to adjust it.
https://docs.llamaindex.ai/en/stable/examples/llm/vllm/

image

Copy link

dosubot bot commented Apr 19, 2024

It appears there's a discrepancy between the documentation you followed and the correct import path for VllmServer. The correct import statement should be:

from llama_index.llms.vllm import VllmServer

Ensure your code reflects this change. If you're still facing issues, here are a few steps to troubleshoot:

  1. Verify the llama_index package is installed with pip install llama_index.
  2. Check your Python environment to ensure it includes the llama_index package. You can do this by printing sys.path.
  3. Ensure there are no naming conflicts in your working directory, such as files or directories named llama_index.
  4. If using a Jupyter notebook, try restarting the kernel after installing new packages.

If the problem persists, consider providing more details about your setup, including the Python version, operating system, and the output of pip show llama_index. This will help in diagnosing the issue more accurately.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

Feel free to make a pr :)

@Manel-Hik
Copy link

Manel-Hik commented Apr 19, 2024

Hi I'm working on similar use case doing a rag based on knwoledge graph using vllmserver,
where I defined my vllmserver like this:
llm = VllmServer(
api_url=my_api_url
model=model_name,
max_new_tokens=1024,
temperature=0,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
)
after defining the index using the knowledgegraphindex like this
#Construct the Knowlege Graph Index it can take a while
index = KnowledgeGraphIndex.from_documents(
documents,
max_triplets_per_chunk=3,
storage_context=storage_context,
include_embeddings=True# it governs the number of relationship triplets processed per data chunk
)

I got this error:

KeyError Traceback (most recent call last)
in <cell line: 2>()
1 #Construct the Knowlege Graph Index it can take a while
----> 2 index = KnowledgeGraphIndex.from_documents(
3 documents,
4 max_triplets_per_chunk=3,
5 storage_context=storage_context,

11 frames
/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm/utils.py in get_response(response)
7 def get_response(response: requests.Response) -> List[str]:
8 data = json.loads(response.content)
----> 9 return data["text"]
10
11

KeyError: 'text'

Could you help me to figure this out
I'm using the Version: 0.10.30 of llamaindex
Thanks in advance

@buaaflyaway
Copy link

Hi I'm working on similar use case doing a rag based on knwoledge graph using vllmserver, where I defined my vllmserver like this: llm = VllmServer( api_url=my_api_url model=model_name, max_new_tokens=1024, temperature=0, messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, ) after defining the index using the knowledgegraphindex like this #Construct the Knowlege Graph Index it can take a while index = KnowledgeGraphIndex.from_documents( documents, max_triplets_per_chunk=3, storage_context=storage_context, include_embeddings=True# it governs the number of relationship triplets processed per data chunk )

I got this error:

KeyError Traceback (most recent call last) in <cell line: 2>() 1 #Construct the Knowlege Graph Index it can take a while ----> 2 index = KnowledgeGraphIndex.from_documents( 3 documents, 4 max_triplets_per_chunk=3, 5 storage_context=storage_context,

11 frames /usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm/utils.py in get_response(response) 7 def get_response(response: requests.Response) -> List[str]: 8 data = json.loads(response.content) ----> 9 return data["text"] 10 11

KeyError: 'text'

Could you help me to figure this out I'm using the Version: 0.10.30 of llamaindex Thanks in advance

Hello, I got the same Error. Turns out, the url is not correct, it shoult be "http://localhost:8000/v1/completions". The vllm server backend raised a 404 error. The code has not check the error, and continue excute, so no data was found, hence the error.
When using "http://localhost:8000/v1/completions" as api url, I got a 400 Bad Request error. Don't know why. Still fixing

@Vincewz
Copy link

Vincewz commented Jun 17, 2024

I got the same Error

@buaaflyaway
Copy link

buaaflyaway commented Jun 19, 2024

I've fixed this bug, but it's not the optimal solution:

  1. I rewrote VllmServer to use the "http://localhost:8000/v1/completions" endpoint instead of "generate".
  2. I tried to directly retrieve the generation using the rewritten completed class, but the order of the tokens was messed up, possibly related to an incomplete rewrite.
  3. Therefore, I attempted to customize a retrieval process, and that made it work.

Step 1: Rewrite the vllmserver

# Rewrite the vllmserver
from typing import Any
from openai import OpenAI
from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.llms.vllm import VllmServer


class VllmCustomServer(VllmServer):
    @llm_completion_callback()
    def complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        # build sampling parameters
        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        output = completion.choices[0].message
        return CompletionResponse(text=output.content)

    @llm_completion_callback()
    def stream_complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponseGen:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            stream=True
        )

        def gen() -> CompletionResponseGen:
            response_str = ""
            prev_prefix_len = len(prompt)
            for chunk in completion:
                if chunk:
                    data = chunk.choices[0].delta.content
                    if data:
                        increasing_concat = data
                        pref = prev_prefix_len
                        prev_prefix_len = len(increasing_concat)
                        yield CompletionResponse(
                            text=increasing_concat, delta=increasing_concat[pref:]
                        )

        return gen()

Step2: customize a retrieval process

# Core code as follows, you should adapt to your own projects
retriver = index.as_retriever(similarity_top_k=4)
nodes_with_score = retriver.retrieve(prompt)
context_str = format_source(nodes_with_score)
prompt_str = f"""你是一个有用的助手,根据上下文回答问题
用户的问题是:
-------------------问题内容-----------------
{prompt}
-----------------问题内容结束----------------
        
下面是可以参考的背景信息。
-------------------信息内容-----------------
{context_str}
-----------------信息内容结束----------------
"""
        if stream:
            return llm.stream_complete(prompt_str)
        else:
            return llm.complete(prompt_str)

@xKwan
Copy link

xKwan commented Jun 27, 2024

I've fixed this bug, but it's not the optimal solution:

  1. I rewrote VllmServer to use the "http://localhost:8000/v1/completions" endpoint instead of "generate".
  2. I tried to directly retrieve the generation using the rewritten completed class, but the order of the tokens was messed up, possibly related to an incomplete rewrite.
  3. Therefore, I attempted to customize a retrieval process, and that made it work.

Step 1: Rewrite the vllmserver

# Rewrite the vllmserver
from typing import Any
from openai import OpenAI
from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.llms.vllm import VllmServer


class VllmCustomServer(VllmServer):
    @llm_completion_callback()
    def complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        # build sampling parameters
        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        output = completion.choices[0].message
        return CompletionResponse(text=output.content)

    @llm_completion_callback()
    def stream_complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponseGen:
        client = OpenAI(
            base_url=self.api_url,
            api_key="EMPTY",
        )
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        completion = client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            stream=True
        )

        def gen() -> CompletionResponseGen:
            response_str = ""
            prev_prefix_len = len(prompt)
            for chunk in completion:
                if chunk:
                    data = chunk.choices[0].delta.content
                    if data:
                        increasing_concat = data
                        pref = prev_prefix_len
                        prev_prefix_len = len(increasing_concat)
                        yield CompletionResponse(
                            text=increasing_concat, delta=increasing_concat[pref:]
                        )

        return gen()

Step2: customize a retrieval process

# Core code as follows, you should adapt to your own projects
retriver = index.as_retriever(similarity_top_k=4)
nodes_with_score = retriver.retrieve(prompt)
context_str = format_source(nodes_with_score)
prompt_str = f"""你是一个有用的助手,根据上下文回答问题
用户的问题是:
-------------------问题内容-----------------
{prompt}
-----------------问题内容结束----------------
        
下面是可以参考的背景信息。
-------------------信息内容-----------------
{context_str}
-----------------信息内容结束----------------
"""
        if stream:
            return llm.stream_complete(prompt_str)
        else:
            return llm.complete(prompt_str)

Can I check this fix is for which version? I am currently having issues performing inference with my vllm server, posted in this thread:
#14420

@kobiche
Copy link
Contributor

kobiche commented Jul 15, 2024

I have the same problem. The thing is (or at least in my case), that the model is missing from the sampling_params.

    @property
    def _model_kwargs(self) -> Dict[str, Any]:
        base_kwargs = {
            "model": self.model,
            "temperature": self.temperature,
            "max_tokens": self.max_new_tokens,
            "n": self.n,
            "frequency_penalty": self.frequency_penalty,
            "presence_penalty": self.presence_penalty,
            "use_beam_search": self.use_beam_search,
            "best_of": self.best_of,
            "ignore_eos": self.ignore_eos,
            "stop": self.stop,
            "logprobs": self.logprobs,
            "top_k": self.top_k,
            "top_p": self.top_p,
        }
        return {**base_kwargs}

@doscherda
Copy link
Contributor

On your client side, look in llms/vllm/utils.py:

def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
return data["text"]

Add an extra print for debug:

def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
print("RESPONSE DATA IS: ", data)
return data["text"]

The message returned from the server should help debug the problem.
I suspect you will see a message that the "model" keyword is required but is missing.

Please post your results.

@doscherda
Copy link
Contributor

I created a fork with a possible fix for this issue. Can someone test it using their application ? See https://github.com/doscherda/vllm_llama_index/ The specific fix is at: doscherda@f640306

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

8 participants