-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Unexpected Responses from Gemma-2-9b-it #7152
Comments
Does it work well for other models? I'm also having very similar problems when building current version ( F. ex. Mistral NeMo gives me the following output after prompting it with
|
@wwydmanski Hi,
However, I have no idea why |
After quick check, it looks like |
Hi @wwydmanski, |
Your current environment
Hardware; H100 x 1
🐛 Describe the bug
I’m trying to use the current main branch of vllm for inference with gemma-2-9b-it, but the output I’m getting is not as expected (i.e., there is a significant discrepancy compared to the results obtained using Hugging Face, where inference results are more reasonable).
Below is the bash script I used to launch the vllm OpenAI inference server.
Here is the Python code I used with the OpenAI package:
After running the program, I received the following extremely weird response:
However, when I use the same prompt for inference with pure Hugging Face (with the exact same hyperparameters), I get a more reasonable output, as shown below:
Here is the pure Hugging Face inference code:
By the way, I received similar correct responses with pure Hugging Face inference from the NVIDIA NIM playground: https://build.nvidia.com/google/gemma-2-9b-it
I have verified that the model weights are correct and that the chat template has been successfully applied. Additionally, the tokens for both inferences are identical, yet I still received different results.
Has anyone else encountered the same issue? How can it be resolved?
The text was updated successfully, but these errors were encountered: