Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

Closed
eric8607242 opened this issue Aug 5, 2024 · 4 comments
Closed

[Bug]: Unexpected Responses from Gemma-2-9b-it #7152

eric8607242 opened this issue Aug 5, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@eric8607242
Copy link

eric8607242 commented Aug 5, 2024

Your current environment

transformers==4.43.2
flashinfer == 0.1.2+cu121torch2.4
pytorch == 2.4.0

vllm == -e git+https://github.com/vllm-project/vllm.git@cc08fc7225616aeb6709a2e75e5ac47ace124985#egg=vllm
vllm == 2.6.1

Hardware; H100 x 1

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

🐛 Describe the bug

I’m trying to use the current main branch of vllm for inference with gemma-2-9b-it, but the output I’m getting is not as expected (i.e., there is a significant discrepancy compared to the results obtained using Hugging Face, where inference results are more reasonable).

Below is the bash script I used to launch the vllm OpenAI inference server.

VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server \ 
   --host 0.0.0.0 --port 8124 --dtype "auto" --model gemma-2-9b-it -tp 1 --gpu-memory-utilization 0.85 --max-model-len 4096 --disable-sliding-window

Here is the Python code I used with the OpenAI package:

import openai
client = openai.OpenAI(
    base_url="http://0.0.0.0:8124/v1",
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
    model="gemma-2-9b-it",
    messages=[{"role": "user, "content": "What is K8s?"}]
)

After running the program, I received the following extremely weird response:

         (I)      (I)\n         I\n         I\n         I   I   I\n         I   I   I\n         I   I   I\n       I\n       I\n       I\n\n    I     I\n    I     I \n    I     I\n    I     I\n    I     I\n    I     I\n   I\n\n\n\n

However, when I use the same prompt for inference with pure Hugging Face (with the exact same hyperparameters), I get a more reasonable output, as shown below:

K8s (pronounced "kay-eight-ess") is a shortened version of **Kubernetes**, which is an open-source container orchestration system. 

Think of Kubernetes as a conductor for an orchestra of containers. 

**Here's a breakdown:**

* **Containers:** Imagine containers as individual instruments in an orchestra. Each container holds a specific application or service, along with all its dependencies.
* **Orchestration:** Kubernetes acts as the conductor, managing and coordinating these containers. It ensures that:
    * **Containers are running:** Kubernetes automatically starts, stops, and restarts containers as needed.
    * **Containers are scaled:** Kubernetes can automatically increase or decrease the number of running containers based on demand.
    * **Containers are healthy:** Kubernetes monitors the health of containers and takes action if they fail.
    * **Containers communicate:** Kubernetes helps containers communicate with each other and with external services.

**Why is Kubernetes important?**

* **Efficiency:** Kubernetes allows you to run applications more efficiently by utilizing resources effectively and automating tasks.
* **Scalability:** Kubernetes makes it easy to scale applications up or down as needed, ensuring they can handle changing workloads.
* **Reliability:** Kubernetes ensures that applications are always available by automatically restarting failed containers and distributing workloads across multiple machines.
* **Portability:** Kubernetes applications can run on any platform that supports Kubernetes, making them highly portable.

**In short, Kubernetes simplifies the deployment, management, and scaling of containerized applications, making it a powerful tool for modern software development.**

Here is the pure Hugging Face inference code:

from transformers import AutoTokenizer, AutoModelForCausalLM
path_to_model = "gemma-2-9b-it"

tokenizer = AutoTokenizer.from_pretrained(path_to_model)
model = AutoModelForCausalLM.from_pretrained(path_to_model, device_map="auto")

input_text = "What is K8s?"
chat = [
    { "role": "user", "content": input_text},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_length=512)
print(tokenizer.decode(outputs[0]))

By the way, I received similar correct responses with pure Hugging Face inference from the NVIDIA NIM playground: https://build.nvidia.com/google/gemma-2-9b-it

I have verified that the model weights are correct and that the chat template has been successfully applied. Additionally, the tokens for both inferences are identical, yet I still received different results.
Has anyone else encountered the same issue? How can it be resolved?

@eric8607242 eric8607242 added the bug Something isn't working label Aug 5, 2024
@wwydmanski
Copy link

wwydmanski commented Aug 5, 2024

Does it work well for other models? I'm also having very similar problems when building current version (c0d8f1636c58f5464e512eaabfed5aa29f2c5b7d) from source, but I encounter it for every model I'm trying out.

F. ex. Mistral NeMo gives me the following output after prompting it with Hello, what's your name?:

I’m not a morning person,” I said, as I sat down at the table with my coffee.……..(1)__________

@eric8607242
Copy link
Author

eric8607242 commented Aug 5, 2024

@wwydmanski Hi,
I received the correct response from gemma-2-9b-it when I removed VLLM_ATTENTION_BACKEND=FLASHINFER and launched the server using the following script:

python3 -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8124 \
    --dtype "auto" \
    --model gemma-2-9b-it \
    -tp 1 --gpu-memory-utilization 0.85 --max-model-len 4096 --disable-sliding-window

However, I have no idea why FLASHINFER causes such unexpected output.

@wwydmanski
Copy link

After quick check, it looks like FLASHINFER causes output errors since the 954f7305a106058815bd7e47f5b9d585d8764c05 version. I think something went wrong with the #7008 PR

@eric8607242
Copy link
Author

eric8607242 commented Aug 5, 2024

Hi @wwydmanski,
Thanks for your response, since the original issue has been resolved (unreason response), close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants