-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: QWEN2 quantization GGML_ASSERT #7805
Comments
Interestingly when I try Q2_K (and any other K quant it seems, tried Q4_K_S to check) I see a different error:
(repeats several times) |
Files for recreating it locally are up here: https://huggingface.co/bartowski/Qwen2-7B-Instruct-GGUF/tree/main |
If 'Quill' is indeed the leaked version of Qwen2, then something might have changed in llama.cpp that broke Qwen2 conversions into GGUF. I am using a Quill GGUF from Mradermacher's repository about six or so days ago, and it works fine. This assumes that the official release of Qwen2 isn't altered from Quill. |
I think that the imatrix contains |
@slaren are you referring to the --no-ppl option by chance or did you mean something else? when I remove the --no-ppl the imatrix output does look like this: llama_cpp_gpu-1 | [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan, |
Yes. If you ever get |
yikes.. so what the hell am i doing wrong that's not affecting others (or is it affecting others but it's just not obvious cause they didn't calculate imatrix?) |
I even tried updating to master and using different datasets |
If it happens while using CUDA, it usually means that the model is producing activations that cannot be represented in a float16. We have workarounds to force float32 compute in these cases for the known models that need it, but it is very unusual. |
ooo so that's gotta be it then.. strangely i converted to f32 instead of f16, but i assume that's not enough? trying to run my F32 conversion with CUDA gives pure garbage running it with -ngl 0 gives actual results |
It may work with flash attention ( |
The problem seems to be in the KV. This patch should allow it to work: diff --git a/llama.cpp b/llama.cpp
index 32264a00..6324af70 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -7009,7 +7009,7 @@ static struct ggml_tensor * llm_build_kqv(
struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
cb(kq, "kq", il);
- if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX) {
+ if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2) {
// for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
// ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
@@ -15376,6 +15376,10 @@ static void llama_model_quantize_internal(const std::string & fname_inp, c However it also seems to work with flash attn even without |
As always super appreciative of the investigation and solve @slaren , you're awesome <3 |
Assuming the problem is overflows rather than underflows I think the most likely reason is that my FlashAttention implementations initially scale Q with |
@slaren would ROCm also be experiencing this bug or is it just in CUDA code? Curious if other odd bugs out there can be explained by this |
@dillfrescott can you try adding -fa and seeing if that helps? |
@bartowski1182 Damn, that totally fixed it! What does that flag do?? |
Enables flash attention, apparently it runs the operations in a different order that prevents the issues |
Interesting |
This is what I do. And are you using GPU and it might be a CUDA issue? Not sure. This thing also happens in Ollama for some users, but my friend inputs |
@JustinLin610 I am using a 4090 and have cuda version 12.5 installed. It usually works with every other model so I'm just not sure why. |
Any reason why enabling flash_attn for the imatrix calculation the time to complete increased from about 2.5h to 5h (qwen2-72b)? |
Not him but the fundamental issue is that the KQ matrix has values outside the FP16 range. On Volta or newer or RX 7000 or newer llama.cpp does all matrix multiplications with a batch size > 64 using FP16 cuBLAS/HIPBLAS so you get garbage results. The fix works by instead calculating the KQ matrix with FP32 precision. The huge values from the KQ matrix are then fed to softmax which brings them into the 0-1 range again and fixes the numerical issues. You should be able to fix the numerical issues with better performance on Ampere or newer by calculating the KQ matrix as BF16 instead of FP16 (with some precision loss which is probably negligible if you're going to apply softmax anyways). Annoyingly there is to my knowledge no way to make cuBLAS return the result as FP32 (even though that is the only output format that tensor cores actually support) because then we would also save some time on the conversion back to FP32. |
Depends on what hardware you have, on AMD large batch FA performance is bad. I have received some reports about bad performance with partial offloading but so far I have never been able to reproduce this. And note that I think that just adding |
for normal inference works fine fa, it's running on an RTX3090 on Linux so 12 layers only on VRAM With
Without
|
Hello. |
Do you have your quants uploaded somewhere? |
My example is here. |
ignore, it's because i have a p100, bad report hmm, this is odd.. Trying to run a Qwen2 quant (https://huggingface.co/bartowski/Tess-v2.5-Qwen2-72B-GGUF/blob/main/Tess-v2.5-Qwen2-72B-Q2_K.gguf) with GPU offloading yields a new assert: GGML_ASSERT: ggml-cuda/dmmv.cu:665: false based on the assert, my guess is that it's because ffn_down.weight was quantized to IQ4_NL? but obviously i'm not positive. Not offloading works fine, redownloading Q4_K_M to test since i see that didn't use IQ4_NL @slaren any idea why this wasn't an issue in the past? Is there something special in Qwen2 that makes it want to use those quant types? Also it's strange because I thought IQ quants were fully supported on CUDA so i don't get why those asserts exist edit: as suspected, no issues with Q4_K_M |
also gonna pull @JohannesGaessler back in since he seems to know the area very well |
If you are on a P100 or Maxwell or older that is expected. For all quants other than the legacy quants and k-quants there is only a MMVQ implementation (which needs Pascal != P100 or newer) but no DMMV implementation. If you are using a GPU that should be supported that is a bug. |
ah dammit thank you, that's silly of me, i have a 3090 and a p100 and this pushed it onto that card You say p100 or maxwell/older, does that imply the p40 is fine? |
Most k and iq quants have a block size of 256, but this |
Yes, P100s have compute capability 6.0, all other Pascal GPUs (including P40s) have compute capability 6.1 which is the minimum CC for the |
thanks both slaren and Johannes, that's good info, i appreciate it :D back on the original subject of this issue, is there a reason your proposed change wasn't opened as a PR? I can open one if that would help, just unsure if the proposed fix was deemed not appropriate |
I'm testing the IQ3_XXS and seems to work very well. |
Do you mean the fix to use fp32 precision for the attention? I didn't open a PR because the fix would affect all models with qwen2 architecture, and I recall reading that this model has other issues, and if this model is not very good, it may not be worth it to apply a patch that will decrease the performance of all the models with qwen2 architecture. But I may be wrong about that, if you think that it is worth it the fix could be merged. |
The thing is we now have a lot of fine tunes of Qwen2 72b coming out, and presumably they all have this issue (I haven't re-verified) so figured it would make sense, but maybe it is worth double checking, didn't realize there would be a big performance hit |
Thanks for testing. If anyone is interested in it, I would greatly appreciate it if you could attempt to reproduce the steps. |
Thanks! |
Tried creating the imatrix from Q8_0 but it came out same as with f32, I get NaN when quantising |
Thank you! Oh, no. I wonder if this method is depend on imatrix-dataset... |
@grapevine-AI Your imatrix is corrupt BTW:
I have made some progress though, by quantizing the BF16 to F16 and flushing anything <±1e-24 to ±0 I can now make fully working quants (with #7825 and #7955 applied) that do not require Flash Attention! However creating a fully activated imatrix still does not seem feasible... |
It appears that the Qwen2-72B model stopped functioning correctly after release b3091. Result in the release b3091:
Result in the next release b3130:
In the latest release that I checked b3291 the model still doesn't work correctly. The GGUF model was downloaded from here. I started the server from llama-b3091-bin-win-cuda-cu12.2.0-x64.zip with this command:
The GGUF works normally in KoboldCpp v1.69.1 only if "Use FlashAttention" is checked and "Use QuantMatMul (mmq)" is unchecked. Is there an argument to server.exe that disables MMQ? |
If at all possible, please try to reproduce the issue using only llama.cpp code and use git bisect to identify the exact commit that introduced the problem. You can disable MMQ by compiling with Also this very much sounds like a different problem than what was discussed previously here. Instead of commenting on an existing issue, please open a new issue instead. Otherwise there is a large risk that your issue will not get attention from the right people. |
Thank you! I've found the appropriate issue for my problem: Issue #8025 |
Necro-ing this thread, but just chipping in, IQ4_NL is probably not an ideal fallback as it breaks compatibility with any backend that does not have I-Quant support (e.g. Vulkan), and leads to people unsure why certain K-quants just don't work when they're supposed to. I'd rather Q4_0 or Q8_0 be the default fallback for k-quants. This is just my 2 cents. |
Hm, it shouldn't break compatibility - the current implementation will fallback to CPU computation when the backend does not support a certain type. It would be slow, but it would still work |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
When attempting to quantize Qwen2 7B instruct to IQ2_XS I get the following assert:
Anything I can provide to debug? Uploading the f32 file and imatrix now for recreation
Attempting IQ2_S now,
will update if it fails in the same wayupdate: it fails in the same way on the same blockName and Version
Version b3086, ubuntu 22.04
What operating system are you seeing the problem on?
Linux
Relevant log output
(PS: is this a high severity or medium/low?)
The text was updated successfully, but these errors were encountered: