-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: stream-k decomposition for MMQ #8018
CUDA: stream-k decomposition for MMQ #8018
Conversation
Here are the results on the RTX 2060, data for 1B and 7B models: LLAMA_CUDA=1 ./scripts/compare-commits.sh master da1db13d6aaabd0889a1bc1c7755a462621bd1d6 \
-m models/llama-7b-v2/ggml-model-q4_0.gguf \
-m models/llama-7b-v2/ggml-model-q4_k.gguf \
-m models/tinyllama-1b/ggml-model-q4_0.gguf \
-m models/tinyllama-1b/ggml-model-q8_0.gguf \
-p 2048 -ub 16,32,64,128,256,512,1024,2048 -ngl 99 -n 0 Performance vs. master cuBLASggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
LLAMA_CUDA_FORCE_MMQ=1 LLAMA_CUDA=1 ./scripts/compare-commits.sh master da1db13d6aaabd0889a1bc1c7755a462621bd1d6 \
-m models/llama-7b-v2/ggml-model-q4_0.gguf \
-m models/llama-7b-v2/ggml-model-q4_k.gguf \
-m models/tinyllama-1b/ggml-model-q4_0.gguf \
-m models/tinyllama-1b/ggml-model-q8_0.gguf \
-p 2048 -ub 16,32,64,128,256,512,1024,2048 -ngl 99 -n 0 Performance vs. master MMQggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
|
Thank you! That looks fine to me. |
I cannot reproduce the failing server test from the CI on my local machine. But the CI server tests are run with the CPU backend anyways, right? So this PR should have no effect on them. |
It's not related, the error was first seen in #7993. |
I got these errors with
|
The issue should be fixed now. The TLDR is that I had not considered an edge case for very small matrices. More specifically, for very small matrices there were instances where the fixup kernel assumed an MMQ CUDA block had written something to the fixup buffer when in reality it had simply returned because the k slice assigned to it had a length of zero (because there just wasn't enough work to distribute). The fixup kernel then read undefined memory from the pool buffer and added it to the final result. If the memory was zero that caused no issues. If the memory contained random garbage the result was NaN. So whether or not the defect manifested as a bug depended heavily on the exact setup that the tests were run with. I think it would make sense to add a test or debug mode that fills the pool buffers with NaN before returning them. That way, if any kernel uses any memory that was not previously written to it should become detectable (unless there already exists a tool to detect this kind of problem that I am not aware of). |
Anything that improves the testing capabilities would be good. I plan to do some work on I tested the performance with the 3090 Ti and it is similar to your results with the 3090 as usual. I also tried with a 3080 and it is not always an improvement. However this GPU is my display device and the results are not as repeatable as with the 3090 Ti.
|
There seems to be |
* CUDA: stream-k decomposition for MMQ * fix undefined memory reads for small matrices
This PR implements "stream-k decomposition" as described in this paper for the MMQ kernels. The idea is that instead of dividing the output tensor into tiles and assigning those tiles to the streaming multiprocessors in waves you instead assign partial tiles to the streaming multiprocessors that you then combine with an additional "fixup" kernel. Notably the number of tiles that you need to fix is constant and equal to the number of streaming multiprocessors (vs. e.g. splitting all tiles).
Performance vs. master MMQ
Performance vs. master cuBLAS
The performance increase for small batch sizes (where you suffer more from tail effects) is quite noticeable, for large batch sizes it's only a few percent. You could potentially get more performance if you were to make the implementation nondeterministic with atomic adds (on my desktop with the RTX 3090 the matrix multiplication kernels need 70-760 µs, the fixup kernel needs ~20 µs). For my P40 and RX 6800 the stream-k kernel was slower so there is no change for those cards.
@ggerganov IIRC you have an RTX 2060, can you check the performance of this PR vs. master (both with
LLAMA_CUDA_FORCE_MMQ
)? Compared to my GPUs that GPU has significantly fewer SMs so I would expect this PR to provide comparatively less benefit (but still some overhead from the fixup kernel which may outweigh the speedup).