-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [Qwen] Draft model produce garbage output #674
Comments
Do you get the same results without |
Just tried. Works even better overall than with tensor parallelism. What's the trick :D ? |
The TP implementation in ExLlama is a little half-baked. Problem is CUDA isn't great for controlling multiple devices in parallel from a single process, so there needs to be a big rewrite to split inference into multiple processes. Also just a lot of of device synchronization stuff that needs to be optimized. It will happen, at some point, and until then TP is only sort of situationally useful. |
After pretty long testing still find ollama/llama.cpp output outperformed exl2 by far. While exl2 running ~20% faster without draft model and almost 100% faster with draft model, it's output in both cases are often corrupted by small mistakes which accumulates quickly. I can't figure out why it so much worse. I'm compared iq4_XS and 4.0bpw quants for 72B, Q6_K and 6.0bpw for 32B - results are same. Maybe I'm missing something? May it be template issue? Or anyhting obvious and simple to mitigate. |
Are they same tokenizer? |
OS
Windows
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.4.1+cu121
Model
Qwen/Qwen2.5-72B-Instruct
Describe the bug
Qwen 2.5 72B Instruct with draft model whenever it's qwen 2.5 0.5B or 1.5B produces garbage. Sometimes it takes few requests to lost context, but it always going insane and inconsistent with repetitions and garbage with long context conversation (15900 tokens). It's going from
okay, that's expected
througha lot of typos
andwow, such chinese!
to infiniti repetition of somehow related trash likeSame model without draft produces consistent and good output (with slower tps 😄 )
Reproduction steps
Using TabbyAPI
config.yml
Load model
Qwen2.5-72B-Instruct-exl2 - 4.0bpw from exllama 2.4.3
Qwen2.5-0.5b-instruct-exl2 - 4.0bpw from exllama 2.4.3
Quants created from original model downloaded at same time today from official Qwen repository.
Qwen2.5-72B-Instruct-exl2 without draft model works fine
Generate chat completitions
I'm using Open Web UI, but I think it doesn't matter a lot.
Here is output from tabby api generation settings (everything at default):
Expected behavior
Draft model doesn't affect quality of output for base model.
Logs
No response
Additional context
I've noticed in models config that Qwen 72B has slightly bigger
vocab_size
than Qwen 0.5B. Looks like Qwen models from 0.5 to 14 has"vocab_size": 151936
while 32B and 72B"vocab_size": 152064
. I don't know if this may affect generation.Acknowledgements
The text was updated successfully, but these errors were encountered: