Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask cache Performance Optimization for vllm #939

Merged
merged 5 commits into from
Jun 16, 2024

Conversation

paul-grundmann
Copy link
Contributor

@paul-grundmann paul-grundmann commented Jun 3, 2024

Problem

The current implementation allocates a mask for every token during generation, which significantly impacts performance.

Proposed Solution

To improve the performance, we can cache the mask on the device, as it depends on the allowed tokens from the FSM. Additionally, limiting the input to the hash function to the first 2k tokens results in a notable speedup.

Discussion

While using only the first 2k tokens for the hash may introduce potential cache collisions, the likelihood of such collisions is very low.

TODO

  • Provide measurements of the performance impact

@rlouf
Copy link
Member

rlouf commented Jun 6, 2024

My worry with limiting the number of tokens for the hash is that we might return identical masks for different lists of tokens; it is not excluded that first 2048 allowed tokens would be the same but next ones different.

@paul-grundmann
Copy link
Contributor Author

I've made additional optimizations and determined that caching based solely on the state id is sufficient. This eliminates the need for the sometimes expensive hashing of allowed input tokens. As a result, on our A100 and H100 hardware, generation performance is nearly as fast as it is without a logits_processor

@rlouf rlouf added structured generation Linked to structured generation vLLM Things involving vLLM support labels Jun 11, 2024
@rlouf
Copy link
Member

rlouf commented Jun 11, 2024

Could you run pre-commit locally?

@lapp0
Copy link
Contributor

lapp0 commented Jun 13, 2024

Could you please rebase onto main? Your diff is currently the entire repo.

@paul-grundmann
Copy link
Contributor Author

Rebase is done. Don't know what happened there.

Benchmark

I also ran some benchmarks on 128 samples using an H100 PCIe GPU using LLaMA3 8B:

Without Cache:
128/128 [03:46<00:00,  1.77s/it, Generation Speed: 81.87 toks/s]

Cold Cache:
128/128 [00:41<00:00,  3.11it/s, Generation Speed: 451.46 toks/s]

Warm Cache:
128/128 [00:13<00:00,  9.74it/s, Generation Speed: 1412.09 toks/s]

In addition here are the plots of the GPU utilization during the benchmark:

No cache

Benchmark without cache

Cold cache

Benchmark with a cold cache

Warm cache

Benchmark with a fully warmed up cache

As depicted in the plots: Without caching the masks, the gpu utilization remains very low. A fully warmed up cache results in an improvement of over 17x and an almost fully used GPU (in terms of GPU utilization and power consumption). I would estimate that the difference could be even higher for larger amounts of samples.

@rlouf
Copy link
Member

rlouf commented Jun 15, 2024

Looks good to me, thank you for contributing!

@rlouf rlouf merged commit 0c1935a into dottxt-ai:main Jun 16, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
structured generation Linked to structured generation vLLM Things involving vLLM support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants