Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Badcase]: Gibberish output of Qwen2.5-3B-Instruct with Q2_K quantization, Llama.cpp #1237

Open
4 tasks done
simmonssong opened this issue Mar 17, 2025 · 3 comments
Open
4 tasks done

Comments

@simmonssong
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen2.5-3B-Instruct

What is the scenario where the problem happened?

Qwen2.5-3B-Instruct with Q2_K quantization, Llama.cpp

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

  • OS: Windows 10 & 11
  • Python: 3.13
  • CPU

Description

Steps to reproduce

I tried converting Qwen2.5-3B-Instruct into Q2_K quantization on two different machines. The output of the compressed model is always nonsense:
,“||9"363的76...5 31367244一246“),).请-264“3))-64))5761595431636843467435565846"):4843)"),\n5353"34“ ), 3\"6)) the"24\n\n964

But it seems that this only happens to Q2_K.

  • Platforms:
    Windows 10 with llama.cpp build b4846.
    Windows 11 with llama.cpp build b4520.

  • Original model:
    https://huggingface.co/Qwen/Qwen2.5-3B-Instruct

  • Conversion script:
    python convert_hf_to_gguf.py ***\Qwen2.5-7B-Instruct --outfile ***\Qwen2.5-7B-Instruct-FP16.gguf

  • Quantization script:
    llama-quantize.exe ***\Qwen2.5-3B-Instruct-FP16.gguf ***\Qwen2.5-3B-Instruct-Q2_K.gguf Q2_K

  • Model testing script:
    llama-cli.exe -m ***\Qwen2.5-3B-Instruct-Q2_K.gguf

Expected results

The results are expected to be ordinary output of words.

Attempts to fix

I have tried several ways to fix this but no help, including:

  1. Change different machines and versions of Llama.cpp.
  2. Change different Qwen models, e.g, 7B-Instruct, 1B-Instruct, 14B-Instruct
  3. Change quantization methods of Llama.cpp, e.g., Q8_0, Q5_0, Q4_0, Q3_k, Q4_k, Q5_k, Q6_k,

Anything else helpful for investigation

None.

@jklj077
Copy link
Collaborator

jklj077 commented Mar 17, 2025

You should provide an importance matrix for Q2_K. Refer to the documentation for how to do that: https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html (some parts are a little bit outdated)
Or other quant types such as IQ2_XS or IQ2_XXS. See https://www.github.com/ggml-org/llama.cpp/pull/4897 for reference (importance matrix also needed).

Otherwise, use the available Q2_K quants from https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF or others.

@simmonssong
Copy link
Author

It is optional for Q2_K quantization. I don't think the gibberish output is caused by the missing of importance matrix. The importance matrix will decrease perplexity. But the perplexity of gibberish output seems to be infinity.

@simmonssong
Copy link
Author

BTW, the calculation of important matrix is too slow on laptop. Do you have official importance matrices?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants