-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mask cache Performance Optimization for vllm #939
Conversation
My worry with limiting the number of tokens for the hash is that we might return identical masks for different lists of tokens; it is not excluded that first 2048 allowed tokens would be the same but next ones different. |
I've made additional optimizations and determined that caching based solely on the state id is sufficient. This eliminates the need for the sometimes expensive hashing of allowed input tokens. As a result, on our A100 and H100 hardware, generation performance is nearly as fast as it is without a logits_processor |
Could you run |
Could you please rebase onto |
cf9105c
to
c247c31
Compare
c247c31
to
1385324
Compare
Rebase is done. Don't know what happened there. BenchmarkI also ran some benchmarks on 128 samples using an H100 PCIe GPU using LLaMA3 8B:
In addition here are the plots of the GPU utilization during the benchmark: No cacheCold cacheWarm cacheAs depicted in the plots: Without caching the masks, the gpu utilization remains very low. A fully warmed up cache results in an improvement of over 17x and an almost fully used GPU (in terms of GPU utilization and power consumption). I would estimate that the difference could be even higher for larger amounts of samples. |
Looks good to me, thank you for contributing! |
Problem
The current implementation allocates a mask for every token during generation, which significantly impacts performance.
Proposed Solution
To improve the performance, we can cache the mask on the device, as it depends on the allowed tokens from the FSM. Additionally, limiting the input to the hash function to the first 2k tokens results in a notable speedup.
Discussion
While using only the first 2k tokens for the hash may introduce potential cache collisions, the likelihood of such collisions is very low.
TODO