Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2254 #90

Merged
merged 2 commits into from
Feb 24, 2024
Merged

b2254 #90

merged 2 commits into from
Feb 24, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

ikawrakow and others added 2 commits February 24, 2024 16:23
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
#5699)

* server: #5655 - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs
@Nexesenex Nexesenex merged commit 8c4c1d0 into Nexesenex:_master_up Feb 24, 2024
8 of 10 checks passed
@brokofankone
Copy link

@Nexesenex Could I make a request to you to build a Cublas 12.3 ampere exe for your latest fork? It's still way faster on Ampere, and new LR builds are for sure slower, I don't think this issue was fixed at all. I'm running a single RTX 3070 and in some cases your ampere fork is giving me double speed on the same quant versus latest LR builds with CB 12.4. :/

Sadly, I'm having issues building it myself, and tbh I think it would be a net positive for Ampere users to have a reliably-built executable! Thank you for your time and efforts in any case ❤️

@Nexesenex
Copy link
Owner Author

Ok @brokofankone, I will make my next build on 12.3 instead of 12.4.

Give me your feedback once you test it!

@brokofankone
Copy link

Ok @brokofankone, I will make my next build on 12.3 instead of 12.4.

Give me your feedback once you test it!

Awesome, thank you!!

@Nexesenex
Copy link
Owner Author

@brokofankone it's online, ready for testing!

@brokofankone
Copy link

brokofankone commented Apr 11, 2024

@brokofankone it's online, ready for testing!

Massive thanks for the build!

But now I'm really confused 😮

image

Your previous Ampere build is twice as fast on the same quant, same kobold preset, same everything and swiping on the same prompt "Write a story" over and over again. With 9B Q4_K_M and IQ3_XSS I saw no difference between 12.4 (latest lost ruins) and 12.3 (your latest).

But on your old Ampere build, this 11B Q4_K_M quant is way faster, almost twice as fast. On your new Ampere build it's slightly faster, more consistently so than the 12.4 build, however this could also be just a bit of noise, unlike with the old build vs new one. I am not sure what in your previous build is causing this massive improvement now, but it is there and with some quants it's quite pronounced.

Notably, the processing speed with the latest your vs lost ruins looks the same, but on the old one, the blas processing is slower, however it comes out as overall much faster.

@Nexesenex
Copy link
Owner Author

Nexesenex commented Apr 11, 2024

Well, I might have a little idea about what's up, but I didn't care enough to make tests when I published the last days builds.
Thanks for the Cublas 12.3/12.4 comparison, I'll revert on 12.3 from now on.
As for the main problem, lemme dig a bit in the recent Cuda changes of KoboldCPP.

@brokofankone
Copy link

Well, I might have a little idea about what's up, but I didn't care enough to make tests when I published the last days builds. Thanks for the Cublas 12.3/12.4 comparison, I'll revert on 12.3 from now on. As for the main problem, lemme dig a bit in the recent Cuda changes of KoboldCPP.

Thank you for doing the build and looking into it, much appreciated :)) This example above is using one quant to show the improvement, but I had some other cases where I noticed almost x2 speed-up in favor of your old build versus the previous 3 Lost ruins versions, so I would say it's not just a fluke, I've been noticing it for a while. You seem like you have an idea, which gives me a lot of hope ❤️‍🔥

@Nexesenex
Copy link
Owner Author

Well, I obtain similar results between my old fas builds, my new builds of this month, and whatever I try actually slow things down rather than helping it.
I have also a Ryzen 5xxx and a Geforce 30xx, so I'd suggest you to update your drivers to last version, and beyond that I don't know.

@brokofankone
Copy link

Well, I obtain similar results between my old fas builds, my new builds of this month, and whatever I try actually slow things down rather than helping it. I have also a Ryzen 5xxx and a Geforce 30xx, so I'd suggest you to update your drivers to last version, and beyond that I don't know.

Thanks for trying it out! I'll update the studio driver today and see if there is a difference. Do I need to somehow update CUDA seperately from the nvidia studio driver?

@brokofankone
Copy link

brokofankone commented Apr 12, 2024

@Nexesenex

Updated the Nvidia Studio driver (I think it was one release behind), and updated the CUDA Toolkit to 12.4.

Same results, the old 1.59 build is x2 faster (getting 14-17 t/s on the new ones and up to 27-30t/s on the old one with the q4 11b quant).

I think I might just stay on the old build for specific ("trusted") quants, it's just too much of a gain ultimately. I guess if you aren't seeing any difference at all then it may be some weird quirk with my system setup.

Thanks a ton for helping out, the new build and your testing! 🙏

Could this be... a gguf arc thing? Because my quant here is reported as llama (since the solar builds are technically based on llama arc at the lowest level). Maybe Mistral is not affected by this because the 9B I use for testing all are mistral. The 11B is a solar-based merge reported as llama.

@Nexesenex
Copy link
Owner Author

Nexesenex commented Apr 12, 2024

On KCPP Bench, I reach PP 1000t/s TG 45 t/s on both 1.59 and 1.62 builds at 8k context on your model fully offloaded.
And TG 55t/s at 2k context.
More precisely, 1.59d is broadly 2% faster in generation, and 3% slower in PP than 1.62x, both in Cublas 12.3.
There might be a thing, because these slight differences repeat, but it doesn't affect at all my experience.
And there's so many conflicts now between the old commits and the new ones that I can't sort this out, and it might just be related to deeper change in LlamaCPP rather than on LostRuins's modifs.
Be sure to check if there are no MMQ argument in your command line, which might have been ignored in 1.59 and used in 1.62, and your huge perf differences make me think about either MMQ activation (slower than non-MMQ execution on Ampere starting with/beyond batch size 32), either about row split being used instead of layers which are best for Ampere (but that's for multi-GPU, it should not affect you).

@brokofankone
Copy link

Thanks again for trying this out. I am leaning towards some type of llama change that causes this, I was sure it was Cublas 12.3 as it was about Ampere and that's what I'm using, but now it seems like that's not really the case.

As for MMQ, I cannot run K-quants without it. MMQ apparently has its own quirks like that, and it makes quants like IQ faster but on Q4_K_M it will not even start outputting at all, so these tests I have been doing are both with it turned on.

@Nexesenex
Copy link
Owner Author

You're welcome.
I digged out my 16 march internal compile, see if it can help.

Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
TG-128(LLaMA-3.1-8B) goes to 52.5 t/s up from 48.4 t/s.

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants