[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

eric8607242 · 2024-08-05T08:58:11Z

Your current environment

transformers==4.43.2
flashinfer == 0.1.2+cu121torch2.4
pytorch == 2.4.0

vllm == -e git+https://github.com/vllm-project/vllm.git@cc08fc7225616aeb6709a2e75e5ac47ace124985#egg=vllm
vllm == 2.6.1

Hardware; H100 x 1

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

🐛 Describe the bug

I’m trying to use the current main branch of vllm for inference with gemma-2-9b-it, but the output I’m getting is not as expected (i.e., there is a significant discrepancy compared to the results obtained using Hugging Face, where inference results are more reasonable).

Below is the bash script I used to launch the vllm OpenAI inference server.

VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server \ 
   --host 0.0.0.0 --port 8124 --dtype "auto" --model gemma-2-9b-it -tp 1 --gpu-memory-utilization 0.85 --max-model-len 4096 --disable-sliding-window

Here is the Python code I used with the OpenAI package:

import openai
client = openai.OpenAI(
    base_url="http://0.0.0.0:8124/v1",
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
    model="gemma-2-9b-it",
    messages=[{"role": "user, "content": "What is K8s?"}]
)

After running the program, I received the following extremely weird response:

         (I)      (I)\n         I\n         I\n         I   I   I\n         I   I   I\n         I   I   I\n       I\n       I\n       I\n\n    I     I\n    I     I \n    I     I\n    I     I\n    I     I\n    I     I\n   I\n\n\n\n

However, when I use the same prompt for inference with pure Hugging Face (with the exact same hyperparameters), I get a more reasonable output, as shown below:

K8s (pronounced "kay-eight-ess") is a shortened version of **Kubernetes**, which is an open-source container orchestration system. 

Think of Kubernetes as a conductor for an orchestra of containers. 

**Here's a breakdown:**

* **Containers:** Imagine containers as individual instruments in an orchestra. Each container holds a specific application or service, along with all its dependencies.
* **Orchestration:** Kubernetes acts as the conductor, managing and coordinating these containers. It ensures that:
    * **Containers are running:** Kubernetes automatically starts, stops, and restarts containers as needed.
    * **Containers are scaled:** Kubernetes can automatically increase or decrease the number of running containers based on demand.
    * **Containers are healthy:** Kubernetes monitors the health of containers and takes action if they fail.
    * **Containers communicate:** Kubernetes helps containers communicate with each other and with external services.

**Why is Kubernetes important?**

* **Efficiency:** Kubernetes allows you to run applications more efficiently by utilizing resources effectively and automating tasks.
* **Scalability:** Kubernetes makes it easy to scale applications up or down as needed, ensuring they can handle changing workloads.
* **Reliability:** Kubernetes ensures that applications are always available by automatically restarting failed containers and distributing workloads across multiple machines.
* **Portability:** Kubernetes applications can run on any platform that supports Kubernetes, making them highly portable.

**In short, Kubernetes simplifies the deployment, management, and scaling of containerized applications, making it a powerful tool for modern software development.**

Here is the pure Hugging Face inference code:

from transformers import AutoTokenizer, AutoModelForCausalLM
path_to_model = "gemma-2-9b-it"

tokenizer = AutoTokenizer.from_pretrained(path_to_model)
model = AutoModelForCausalLM.from_pretrained(path_to_model, device_map="auto")

input_text = "What is K8s?"
chat = [
    { "role": "user", "content": input_text},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_length=512)
print(tokenizer.decode(outputs[0]))

By the way, I received similar correct responses with pure Hugging Face inference from the NVIDIA NIM playground: https://build.nvidia.com/google/gemma-2-9b-it

I have verified that the model weights are correct and that the chat template has been successfully applied. Additionally, the tokens for both inferences are identical, yet I still received different results.
Has anyone else encountered the same issue? How can it be resolved?

The text was updated successfully, but these errors were encountered:

wwydmanski · 2024-08-05T08:59:41Z

Does it work well for other models? I'm also having very similar problems when building current version (c0d8f1636c58f5464e512eaabfed5aa29f2c5b7d) from source, but I encounter it for every model I'm trying out.

F. ex. Mistral NeMo gives me the following output after prompting it with Hello, what's your name?:

I’m not a morning person,” I said, as I sat down at the table with my coffee.……..(1)__________

eric8607242 · 2024-08-05T09:09:24Z

@wwydmanski Hi,
I received the correct response from gemma-2-9b-it when I removed VLLM_ATTENTION_BACKEND=FLASHINFER and launched the server using the following script:

python3 -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8124 \
    --dtype "auto" \
    --model gemma-2-9b-it \
    -tp 1 --gpu-memory-utilization 0.85 --max-model-len 4096 --disable-sliding-window

However, I have no idea why FLASHINFER causes such unexpected output.

wwydmanski · 2024-08-05T09:11:11Z

After quick check, it looks like FLASHINFER causes output errors since the 954f7305a106058815bd7e47f5b9d585d8764c05 version. I think something went wrong with the #7008 PR

eric8607242 · 2024-08-05T09:15:06Z

Hi @wwydmanski,
Thanks for your response, since the original issue has been resolved (unreason response), close this issue.

eric8607242 added the bug Something isn't working label Aug 5, 2024

eric8607242 closed this as completed Aug 5, 2024

PeterGriffinJin mentioned this issue Aug 6, 2024

RuntimeError: Out of workspace memory in AlignedAlloactor when there is a lot of GPU memory left flashinfer-ai/flashinfer#362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

eric8607242 commented Aug 5, 2024 •

edited

Loading

wwydmanski commented Aug 5, 2024 •

edited

Loading

eric8607242 commented Aug 5, 2024 •

edited by linear bot

Loading

wwydmanski commented Aug 5, 2024

eric8607242 commented Aug 5, 2024 •

edited by linear bot

Loading

[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

Comments

eric8607242 commented Aug 5, 2024 • edited Loading

Your current environment

🐛 Describe the bug

wwydmanski commented Aug 5, 2024 • edited Loading

eric8607242 commented Aug 5, 2024 • edited by linear bot Loading

wwydmanski commented Aug 5, 2024

eric8607242 commented Aug 5, 2024 • edited by linear bot Loading

eric8607242 commented Aug 5, 2024 •

edited

Loading

wwydmanski commented Aug 5, 2024 •

edited

Loading

eric8607242 commented Aug 5, 2024 •

edited by linear bot

Loading

eric8607242 commented Aug 5, 2024 •

edited by linear bot

Loading