[misc][optimization] optimize data structure in allocator #5968

youkaichao · 2024-06-28T17:20:36Z

data structure used in allocator can be improved in efficiency.

we can use plain list to track ref count, given that block ids are consecutive in one allocator.

this simple change can already bring some performance benefit:

before: Throughput: 38.93 requests/s, 19932.08 tokens/s

after: Throughput: 39.15 requests/s, 20044.03 tokens/s

cadedaniel · 2024-07-01T05:07:08Z

what batch size is the perf optimization measured at?

cadedaniel · 2024-07-01T05:08:43Z

OK can we get @alexm-neuralmagic to review this? I want to make sure this optimization is compatible with #5602

cadedaniel

Looks good, comments are all about cleaning up the code / reducing coupling between interfaces and implementation.

Let's make sure it fits with @alexm-neuralmagic PR

vllm/core/block/common.py

vllm/core/block/cpu_gpu_block_allocator.py

cadedaniel · 2024-07-01T05:14:19Z

vllm/core/block/interfaces.py

+    num_blocks: int
+    block_size: int
+    block_index_start: int
+    block_index_end: int


Let's define these as @property so implementations not coupled with the interface beyond the required bits

I think @property is over used in the project. It also has runtime cost, and you need to keep a private self._num_blocks . That's not good in my opinion.

runtime cost of property is negligible if it helps to create a better/more maintainable API

I don't think @property is more maintainable here. In fact, I think it adds unnecessary complexity. We need to define self. _num_blocks as well as @property def num_blocks. If future class needs to define @property, they can do it. But we don't need to do it here.

We need to prioritize code quality in the block manager because the software design requirements are so complex. Otherwise we will trend towards the V1 block manager and its problems -- untestable, hard to modify thinking about the exponential complexity as prefix caching x cow x swap x lookahead x sliding window are combined.

The concrete things @property buys us here:

We don't allow users to write to these fields (only read). This allows the implementations to trust their fields are not modified, making them easier to reason about. We don't want to have to closely review every open source PR that goes into the block manager to ensure people don't write to these fields.

We allow future implementations to have different implementations of the same API. this can allow performance optimizations (like how this PR required minimal API changes) or different features. We shouldn't trade this off without good reason.

You are correct that this adds overhead (LOAD_ATTR is replaced by a method call). If this is significantly impactful to ITL then we need to reconsider the approach. Unfortunately, if we cannot maintain good design due to Python performance in a component that requires good design, it means we have to rewrite it in C++. Let's motivate this decision by data before we compromise on design requirements.

You are correct that there is more code (storing private fields in constructor and returning them in property). This is required to allow only reads to those fields and is well worth the tradeoff. I think it's a small amount of code and confined to single files :)

We can use alternative methods to make a field read-only like type checking. The downsides of using mypy is that it does not have great check coverage and can miss some cases, whereas defining an API ensures users can't go wrong when they're using it.

I think we are over-reacting here. Let's make our life easier. I don't want to add unnecessary complexity that we need to write so many lines just to define a field. Who will modify these fields?

To summarize, I don't think forcing @property.getter is necessary until I see someone violates this rule, which is very rare in my understanding.

Unfortunately it is necessary to maintain good design of block manager v2. Don't want to regress on that without good reason. The <1% performance improvement of this PR not a good enough reason.

okay, please go ahead then. I don't have bandwidth to follow up with this PR.

vllm/core/block/prefix_caching_block.py

alexm-neuralmagic · 2024-07-01T14:20:44Z

Took a look at the PR and it is a nice optimization of using a list for a sequence of ids instead of a dict. It would work without any issues with the block_manager_v2 optimizations I did, since I simply use the RefCount as is, and the API did not change (just the underlying implementation). Thanks for doing this @youkaichao !

youkaichao · 2024-07-01T20:00:06Z

what batch size is the perf optimization measured at?

python benchmarks/benchmark_throughput.py --output-len 256 --input 256 --model meta-llama/Llama-2-7b-hf -tp 8

vllm/core/block/common.py

cadedaniel · 2024-07-02T02:35:28Z

Sorry, what QPS is this? just trying to understand when this optimization begins to be impactful

youkaichao · 2024-07-02T02:42:17Z

Sorry, what QPS is this? just trying to understand when this optimization begins to be impactful

Need to check with benchmarks/benchmark_throughput.py . Since this is benchmarking throughput, I assume it is a single batch running for 256 tokens.

cadedaniel · 2024-07-02T02:51:10Z

OK I looked through the benchmark and this is batch size 1000.

cadedaniel · 2024-07-02T02:54:47Z

Hm if the improvement is <1% at batch size 1000, I wonder if we need this right now. it might be due to variance, and the property change could impact this small of an improvement.

youkaichao · 2024-07-02T02:58:58Z

Hm if the improvement is <1% at batch size 1000, I wonder if we need this right now. it might be due to variance, and the property change could impact this small of an improvement.

I think this is a why-not problem. When we know all block ids inside an allocator line in a continuous range, why do we need to pay the cost of using a dict to store the ref count?

njhill · 2024-07-02T14:54:48Z

I like the simplification aspect of this but agree if it hampers the design / clean abstractions then it should be justified performance-wise.

Also note that tp=8 with 7b model is going to magnify these kinds of performance differences in relative terms but is not a realistic production configuration; presumably the percentage gain would be quite a bit smaller with e.g. tp=4 and 70b model.

youkaichao added 16 commits June 27, 2024 23:56

refcount and _free_block_indices

7f36fce

format

4bcc786

clean up

c055a3a

update code

fbbd365

use list

ef56cba

remove redundant index end

21c3ee3

update tests

6294a19

provide default value

e202004

fix block size

547caaf

fix test and ref count

378787f

fix test and ref count

ec725de

fix attr

16a0b48

fix tests

a5bff23

fix tests

ce51476

compatible

29bdcbd

Merge branch 'main' into optimize_allocator

d7ad2bd

cadedaniel reviewed Jul 1, 2024

View reviewed changes

youkaichao added 4 commits July 1, 2024 13:01

Merge branch 'main' into optimize_allocator

f2ef885

add typing

07f01fe

add back assertion

04cadfd

add block_ids_to_allocator

fe79101

Yard1 reviewed Jul 1, 2024

View reviewed changes

vllm/core/block/common.py Show resolved Hide resolved

youkaichao added 3 commits July 1, 2024 14:55

update doc

3499ca0

keep _refcounter private

edac6b4

fix tests

587d218

youkaichao requested a review from Yard1 July 1, 2024 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc][optimization] optimize data structure in allocator #5968

[misc][optimization] optimize data structure in allocator #5968

youkaichao commented Jun 28, 2024

cadedaniel commented Jul 1, 2024

cadedaniel commented Jul 1, 2024

cadedaniel left a comment

cadedaniel Jul 1, 2024

youkaichao Jul 1, 2024

Yard1 Jul 1, 2024

youkaichao Jul 1, 2024

cadedaniel Jul 2, 2024

cadedaniel Jul 2, 2024

youkaichao Jul 2, 2024

youkaichao Jul 2, 2024

cadedaniel Jul 2, 2024

youkaichao Jul 2, 2024

alexm-neuralmagic commented Jul 1, 2024

youkaichao commented Jul 1, 2024

cadedaniel commented Jul 2, 2024

youkaichao commented Jul 2, 2024

cadedaniel commented Jul 2, 2024

cadedaniel commented Jul 2, 2024

youkaichao commented Jul 2, 2024

njhill commented Jul 2, 2024

[misc][optimization] optimize data structure in allocator #5968

Are you sure you want to change the base?

[misc][optimization] optimize data structure in allocator #5968

Conversation

youkaichao commented Jun 28, 2024

cadedaniel commented Jul 1, 2024

cadedaniel commented Jul 1, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-neuralmagic commented Jul 1, 2024

youkaichao commented Jul 1, 2024

cadedaniel commented Jul 2, 2024

youkaichao commented Jul 2, 2024

cadedaniel commented Jul 2, 2024

cadedaniel commented Jul 2, 2024

youkaichao commented Jul 2, 2024

njhill commented Jul 2, 2024