-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b2254 #90
b2254 #90
Conversation
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <[email protected]>
@Nexesenex Could I make a request to you to build a Cublas 12.3 ampere exe for your latest fork? It's still way faster on Ampere, and new LR builds are for sure slower, I don't think this issue was fixed at all. I'm running a single RTX 3070 and in some cases your ampere fork is giving me double speed on the same quant versus latest LR builds with CB 12.4. :/ Sadly, I'm having issues building it myself, and tbh I think it would be a net positive for Ampere users to have a reliably-built executable! Thank you for your time and efforts in any case ❤️ |
Ok @brokofankone, I will make my next build on 12.3 instead of 12.4. Give me your feedback once you test it! |
Awesome, thank you!! |
@brokofankone it's online, ready for testing! |
Massive thanks for the build! But now I'm really confused 😮 Your previous Ampere build is twice as fast on the same quant, same kobold preset, same everything and swiping on the same prompt "Write a story" over and over again. With 9B Q4_K_M and IQ3_XSS I saw no difference between 12.4 (latest lost ruins) and 12.3 (your latest). But on your old Ampere build, this 11B Q4_K_M quant is way faster, almost twice as fast. On your new Ampere build it's slightly faster, more consistently so than the 12.4 build, however this could also be just a bit of noise, unlike with the old build vs new one. I am not sure what in your previous build is causing this massive improvement now, but it is there and with some quants it's quite pronounced. Notably, the processing speed with the latest your vs lost ruins looks the same, but on the old one, the blas processing is slower, however it comes out as overall much faster. |
Well, I might have a little idea about what's up, but I didn't care enough to make tests when I published the last days builds. |
Thank you for doing the build and looking into it, much appreciated :)) This example above is using one quant to show the improvement, but I had some other cases where I noticed almost x2 speed-up in favor of your old build versus the previous 3 Lost ruins versions, so I would say it's not just a fluke, I've been noticing it for a while. You seem like you have an idea, which gives me a lot of hope ❤️🔥 |
Well, I obtain similar results between my old fas builds, my new builds of this month, and whatever I try actually slow things down rather than helping it. |
Thanks for trying it out! I'll update the studio driver today and see if there is a difference. Do I need to somehow update CUDA seperately from the nvidia studio driver? |
Updated the Nvidia Studio driver (I think it was one release behind), and updated the CUDA Toolkit to 12.4. Same results, the old 1.59 build is x2 faster (getting 14-17 t/s on the new ones and up to 27-30t/s on the old one with the q4 11b quant). I think I might just stay on the old build for specific ("trusted") quants, it's just too much of a gain ultimately. I guess if you aren't seeing any difference at all then it may be some weird quirk with my system setup. Thanks a ton for helping out, the new build and your testing! 🙏 Could this be... a gguf arc thing? Because my quant here is reported as llama (since the solar builds are technically based on llama arc at the lowest level). Maybe Mistral is not affected by this because the 9B I use for testing all are mistral. The 11B is a solar-based merge reported as llama. |
On KCPP Bench, I reach PP 1000t/s TG 45 t/s on both 1.59 and 1.62 builds at 8k context on your model fully offloaded. |
Thanks again for trying this out. I am leaning towards some type of llama change that causes this, I was sure it was Cublas 12.3 as it was about Ampere and that's what I'm using, but now it seems like that's not really the case. As for MMQ, I cannot run K-quants without it. MMQ apparently has its own quirks like that, and it makes quants like IQ faster but on Q4_K_M it will not even start outputting at all, so these tests I have been doing are both with it turned on. |
You're welcome. |
TG-128(LLaMA-3.1-8B) goes to 52.5 t/s up from 48.4 t/s. Co-authored-by: Iwan Kawrakow <[email protected]>
No description provided.