-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Prefix caching #9668
[V1] Prefix caching #9668
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This is amazing! Do you happen to have any performance benchmarks? |
Plan to run some tests today and will report back once I got something. |
Benchmark
Benchmark Prefix CachingNote that I disabled the warmup phase in this script because after warming up we are benchmarking exactly the same requests, which is not practical. Command
Benchmark ServingServer command
Client command
Full Results
|
5de6d2e
to
2b45c13
Compare
After some thoughts, I feel the current approach of lazy removing reused blocks may be inefficient when cache hit rate is high. It might still be better to implement a simple LRU cache with doubly linked list. I'll try it and benchmark later. |
QQ: How much perf do we lose if we enable prefix caching but got 0% cache hit? |
I did a serving benchmark before and didn't observe obvious performance regression with 0% cache hit. I could run another experiment later. |
This is an awesome writeup, thanks @comaniac, makes a lot of sense to me. One thing we could think about is separating cache maintenance operations as something that can be done in parallel with the forward pass. e.g. instead of DLL (though I guess DLL should also be efficient so perhaps there's negligible value in that). |
Yeah DLL is definitely efficient in terms of time complexity, but every time adding a node requires to create a Python object. I'm afraid that this may introduce non-negligible latency overhead. Your idea of updating cache async is an interesting idea and I'll think about that! |
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
0d73ca7
to
1735f4f
Compare
Updated: Thanks @njhill for the tips. I'm now using a separate thread to perform costly O(n) operations asyncly at the end of each schedule step so that it can be done in parallel with the forward pass. Here are the benchmark results: Server command
Client command (37% hit rate)
Full Results
|
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Thanks @comaniac. I think the plan is to have SPMD style workers, where even with single-GPU the worker will run in a separate process to the scheduler. We can then move the async maintenance to be done between sending/receiving the input/output for each step. Alternatively this could be achieved with a call-back similar to what's done in v0 with the async output processing. But I think @WoosukKwon wanted to avoid that if possible. Otherwise, having a separate thread might interfere with the subsequent critical loop processing before it reaches the GPU forward pass. |
cc @tlrmchlsmth re: workers |
Yeah if we have an async scheduler then it makes sense, but at this moment we don't have that (and I believe we will have a sync scheduler anyways). What you mentioned can definitely be the case (we asyncly process operations before entering the forward pass on GPU). I don't have a better idea now tho. Which approach do you think is more preferable then? The previous approach with some more overheads when hit rate is high, or the DLL approach that introduces more code and data structure? |
Wow, these results look good even at a low cache hit rate |
# block_hashes is a chain of block hashes. If a block hash is not | ||
# in the cached_block_hash_to_id, the following block hashes are | ||
# not computed yet for sure. | ||
if cached_block := self._get_cached_block(block_hash): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one corner case that I had to handle carefully with V0:
- If a sequence has 3 blocks, [b0, b1, b2], and b0, b1 are both cached, but already evicted.
- There are only 2 blocks that could be allocated (i.e. the already freed b1, b2)
When determining if the sequence can be allocated, IIUC, the current impl would:
- See there are 2 blocks cached (b1, b2)
- Calculate there's just b3 are new tokens
- Make it allocatable?
But in fact, this would run out of blocks because as one allocates the b0, b1, there are no more blocks for b3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a sequence has 3 blocks, [b0, b1, b2], and b0, b1 are both cached, but already evicted.
Note that the term "evict" means the block is no longer in the free queue nor in the cached block. It has been re-allocated to store new tokens. I guess what you meant is b0 and b1 are in the free queue but not yet be evicted? In this case yes we could allocate b0 and b1 to another request. If b0 and b1 are evicted, then a new request won't hit the cache.
But in fact, this would run out of blocks because as one allocates the b0, b1, there are no more blocks for b3.
This should not happen, because when we reuse b0 and b1 for request A, both of them will be removed from the free queue (and num_free_blocks - 2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I mean in the free queue. I think I might be missing something in the V1 impl still. I guess the scenario I wanted to clarify is if b0 and b1 are in the "free queue".
Signed-off-by: Cody Yu <[email protected]>
02f1cee
to
499bd7e
Compare
Re-take at #9972 |
This PR adds prefix caching to V1.
Data Structure
Algorithms
Allocate Slots
When a request is scheduled for the first time,
allocate_slots()
is used to allocate blocks based on the current scheduled prompt tokens. If the prompt is chunked due to chunked prefill, we will only allocate blocks for the scheduled tokens. In addition to the scheduled tokens, we also pre-allocate empty blocks to reduce allocation overheads.With prefix caching, when we attempt to allocate a full block, we will compute its block hash and query the cached block map. There are 3 possible outcomes:
Note 1: When cache hit a block in the free block queue, we put the block in a "lazy remove set" instead of immediately removing the block from the queue. This is because removing an element from a queue takes O(N). Instead, when we are allocating a new block and the front block in the queue is marked as lazy remove, we pop the block and move to the next one.
Note 2: When cache miss and we allocate a new block, the token IDs will be added to the allocated block to construct its hash. The block will also be added to the cache if it is full.
Append Slots
When a request is scheduled again,
append_slots()
is used to maybe allocate more blocks. This can be the case of continuous chunked prefill or decode. Here are the steps in the append slots:Free
When a request is done, all its blocks will decrease the reference count by 1. If a block now has 0 reference, it will be freed (push to the free block queue). Note that since we allocate new blocks by popping the free block queue, the block order in the free block queue is also the eviction order. Since we now use LRU eviction policy, the eviction order is
We maintain the above order by pushing free blocks to the queue in the reversed order, so that:
Get Computed Blocks
Before calling
allocate_slots()
, the scheduler callsget_computed_block_ids()
to know how many blocks hits the cache. This function simply computes the hash of full blocks and queries the cache for existing block IDs. This function won't allocate any block or change the block metadata.Duplication
Since V1 has incremental prepare inputs, the block table is append-only. This results in potential duplications as shown below. Suppose we have 2 identical requests (same prompt with greedy sampling) arriving at different time:
TIme 1
Time 2
TIme 3
TIme 4
At time 4, block becomes full and has the same hash and content as block 3. In vLLM V0 block manager, we will free block 4 and assign block 3 to req2 in the next step. However, we cannot do this in V1 because block table is append only. As a result, at this moment the cache will look like:
We consider that this is fine with practical use cases, because:
cc @WoosukKwon @zhuohan123