Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc][optimization] optimize data structure in allocator #5968

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

youkaichao
Copy link
Member

data structure used in allocator can be improved in efficiency.

we can use plain list to track ref count, given that block ids are consecutive in one allocator.

this simple change can already bring some performance benefit:

before: Throughput: 38.93 requests/s, 19932.08 tokens/s

after: Throughput: 39.15 requests/s, 20044.03 tokens/s

@cadedaniel
Copy link
Collaborator

what batch size is the perf optimization measured at?

@cadedaniel
Copy link
Collaborator

OK can we get @alexm-neuralmagic to review this? I want to make sure this optimization is compatible with #5602

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, comments are all about cleaning up the code / reducing coupling between interfaces and implementation.

Let's make sure it fits with @alexm-neuralmagic PR

vllm/core/block/common.py Outdated Show resolved Hide resolved
vllm/core/block/common.py Outdated Show resolved Hide resolved
vllm/core/block/cpu_gpu_block_allocator.py Show resolved Hide resolved
vllm/core/block/cpu_gpu_block_allocator.py Outdated Show resolved Hide resolved
Comment on lines +94 to +97
num_blocks: int
block_size: int
block_index_start: int
block_index_end: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define these as @property so implementations not coupled with the interface beyond the required bits

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @property is over used in the project. It also has runtime cost, and you need to keep a private self._num_blocks . That's not good in my opinion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime cost of property is negligible if it helps to create a better/more maintainable API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think @property is more maintainable here. In fact, I think it adds unnecessary complexity. We need to define self. _num_blocks as well as @property def num_blocks. If future class needs to define @property, they can do it. But we don't need to do it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to prioritize code quality in the block manager because the software design requirements are so complex. Otherwise we will trend towards the V1 block manager and its problems -- untestable, hard to modify thinking about the exponential complexity as prefix caching x cow x swap x lookahead x sliding window are combined.

The concrete things @property buys us here:

  • We don't allow users to write to these fields (only read). This allows the implementations to trust their fields are not modified, making them easier to reason about. We don't want to have to closely review every open source PR that goes into the block manager to ensure people don't write to these fields.
  • We allow future implementations to have different implementations of the same API. this can allow performance optimizations (like how this PR required minimal API changes) or different features. We shouldn't trade this off without good reason.

You are correct that this adds overhead (LOAD_ATTR is replaced by a method call). If this is significantly impactful to ITL then we need to reconsider the approach. Unfortunately, if we cannot maintain good design due to Python performance in a component that requires good design, it means we have to rewrite it in C++. Let's motivate this decision by data before we compromise on design requirements.

You are correct that there is more code (storing private fields in constructor and returning them in property). This is required to allow only reads to those fields and is well worth the tradeoff. I think it's a small amount of code and confined to single files :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use alternative methods to make a field read-only like type checking. The downsides of using mypy is that it does not have great check coverage and can miss some cases, whereas defining an API ensures users can't go wrong when they're using it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are over-reacting here. Let's make our life easier. I don't want to add unnecessary complexity that we need to write so many lines just to define a field. Who will modify these fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize, I don't think forcing @property.getter is necessary until I see someone violates this rule, which is very rare in my understanding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it is necessary to maintain good design of block manager v2. Don't want to regress on that without good reason. The <1% performance improvement of this PR not a good enough reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, please go ahead then. I don't have bandwidth to follow up with this PR.

vllm/core/block/prefix_caching_block.py Outdated Show resolved Hide resolved
@alexm-neuralmagic
Copy link
Contributor

Took a look at the PR and it is a nice optimization of using a list for a sequence of ids instead of a dict. It would work without any issues with the block_manager_v2 optimizations I did, since I simply use the RefCount as is, and the API did not change (just the underlying implementation). Thanks for doing this @youkaichao !

@youkaichao
Copy link
Member Author

what batch size is the perf optimization measured at?

python benchmarks/benchmark_throughput.py --output-len 256 --input 256 --model meta-llama/Llama-2-7b-hf -tp 8

@youkaichao youkaichao requested a review from Yard1 July 1, 2024 22:04
@cadedaniel
Copy link
Collaborator

Sorry, what QPS is this? just trying to understand when this optimization begins to be impactful

@youkaichao
Copy link
Member Author

Sorry, what QPS is this? just trying to understand when this optimization begins to be impactful

Need to check with benchmarks/benchmark_throughput.py . Since this is benchmarking throughput, I assume it is a single batch running for 256 tokens.

@cadedaniel
Copy link
Collaborator

OK I looked through the benchmark and this is batch size 1000.

@cadedaniel
Copy link
Collaborator

Hm if the improvement is <1% at batch size 1000, I wonder if we need this right now. it might be due to variance, and the property change could impact this small of an improvement.

@youkaichao
Copy link
Member Author

Hm if the improvement is <1% at batch size 1000, I wonder if we need this right now. it might be due to variance, and the property change could impact this small of an improvement.

I think this is a why-not problem. When we know all block ids inside an allocator line in a continuous range, why do we need to pay the cost of using a dict to store the ref count?

@njhill
Copy link
Member

njhill commented Jul 2, 2024

I like the simplification aspect of this but agree if it hampers the design / clean abstractions then it should be justified performance-wise.

Also note that tp=8 with 7b model is going to magnify these kinds of performance differences in relative terms but is not a realistic production configuration; presumably the percentage gain would be quite a bit smaller with e.g. tp=4 and 70b model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants