Mask cache Performance Optimization for vllm #939

paul-grundmann · 2024-06-03T11:06:01Z

Problem

The current implementation allocates a mask for every token during generation, which significantly impacts performance.

Proposed Solution

To improve the performance, we can cache the mask on the device, as it depends on the allowed tokens from the FSM. Additionally, limiting the input to the hash function to the first 2k tokens results in a notable speedup.

Discussion

While using only the first 2k tokens for the hash may introduce potential cache collisions, the likelihood of such collisions is very low.

TODO

Provide measurements of the performance impact

rlouf · 2024-06-06T08:41:56Z

My worry with limiting the number of tokens for the hash is that we might return identical masks for different lists of tokens; it is not excluded that first 2048 allowed tokens would be the same but next ones different.

paul-grundmann · 2024-06-07T11:11:54Z

I've made additional optimizations and determined that caching based solely on the state id is sufficient. This eliminates the need for the sometimes expensive hashing of allowed input tokens. As a result, on our A100 and H100 hardware, generation performance is nearly as fast as it is without a logits_processor

rlouf · 2024-06-11T09:24:40Z

Could you run pre-commit locally?

lapp0 · 2024-06-13T02:24:06Z

Could you please rebase onto main? Your diff is currently the entire repo.

paul-grundmann · 2024-06-14T11:21:13Z

Rebase is done. Don't know what happened there.

Benchmark

I also ran some benchmarks on 128 samples using an H100 PCIe GPU using LLaMA3 8B:

Without Cache:
128/128 [03:46<00:00,  1.77s/it, Generation Speed: 81.87 toks/s]

Cold Cache:
128/128 [00:41<00:00,  3.11it/s, Generation Speed: 451.46 toks/s]

Warm Cache:
128/128 [00:13<00:00,  9.74it/s, Generation Speed: 1412.09 toks/s]

In addition here are the plots of the GPU utilization during the benchmark:

No cache

Cold cache

Warm cache

As depicted in the plots: Without caching the masks, the gpu utilization remains very low. A fully warmed up cache results in an improvement of over 17x and an almost fully used GPU (in terms of GPU utilization and power consumption). I would estimate that the difference could be even higher for larger amounts of samples.

rlouf · 2024-06-15T20:37:06Z

Looks good to me, thank you for contributing!

rlouf added structured generation Linked to structured generation vLLM Things involving vLLM support labels Jun 11, 2024

Add vllm mask cache

d676bb9

paul-grundmann force-pushed the token-cache branch from cf9105c to c247c31 Compare June 14, 2024 08:13

Fix broken commits

1385324

paul-grundmann force-pushed the token-cache branch from c247c31 to 1385324 Compare June 14, 2024 08:23

paul-grundmann added 3 commits June 14, 2024 10:59

Fix pre-commit checks, added dict type annotation to mask_cache

c866632

Remove benchmark

49f485b

Fix move to scores device

1a7079b

rlouf merged commit 0c1935a into dottxt-ai:main Jun 16, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mask cache Performance Optimization for vllm #939

Mask cache Performance Optimization for vllm #939

paul-grundmann commented Jun 3, 2024 •

edited

Loading

rlouf commented Jun 6, 2024

paul-grundmann commented Jun 7, 2024

rlouf commented Jun 11, 2024

lapp0 commented Jun 13, 2024

paul-grundmann commented Jun 14, 2024

rlouf commented Jun 15, 2024

Mask cache Performance Optimization for vllm #939

Mask cache Performance Optimization for vllm #939

Conversation

paul-grundmann commented Jun 3, 2024 • edited Loading

Problem

Proposed Solution

Discussion

TODO

rlouf commented Jun 6, 2024

paul-grundmann commented Jun 7, 2024

rlouf commented Jun 11, 2024

lapp0 commented Jun 13, 2024

paul-grundmann commented Jun 14, 2024

Benchmark

No cache

Cold cache

Warm cache

rlouf commented Jun 15, 2024

paul-grundmann commented Jun 3, 2024 •

edited

Loading