-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[NPUW] Share kvcache between prefill and generate when chunking is enabled #32642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPUW] Share kvcache between prefill and generate when chunking is enabled #32642
Conversation
…into as/npuw_lazy_io_alloc
…into as/npuw_lazy_io_alloc
…into as/npuw_share_kvcache
| auto tokens_in_past_chunks = kvcache_desc.num_stored_tokens - m_tokens_in_present_chunk; | ||
| if (tokens_in_past_chunks > 0) { | ||
| auto prefill_past_kv = m_prefill_request->get_tensor(m_prefill_in_ports.at(input_name)); | ||
| auto prefill_past_kv_chunks = uu::make_tensor_slice(prefill_past_kv, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue in changes in copy_kvcache() (segfault/double free or corruption/.as() access)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pre_kv_dim, | ||
| 0u, | ||
| static_cast<uint32_t>(tokens_in_past_chunks)); | ||
| auto prefill_past_kv_chunks = make_tensor_slice(prefill_past_kv, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smirnov-alexey
Given past KV is shared now, we should need a temp buffer to reorder the data like this?
https://github.com/dmatveev/openvino/pull/19/files#diff-b96c675e99b3f5c4633066fdd94631c676fd3d2b21e1c8c008abfbf893cc4bc9R546
Otherwise, the results should be incorrect.
Data corruption will happen in line 482:
uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);
Copy from buffer A to buffer A will cause data corruption.
That's why we need a temp buffer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we now can create strided tensor on line 297 then we can omit a copy here and omit uu::copy_tensor_by_dim for past kv tensors as we need only one for present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, might be old code should be preserved for the case pre_kv_dim != past_kv_dim
…into as/npuw_lazy_io_alloc
…exey/openvino into as/npuw_share_kvcache
|
|
||
| uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim); | ||
| if (!m_past_kv_bound) { | ||
| uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smirnov-alexey Can you get correct results?
I'm seeing accuracy issue cause by missing the changes here.
https://github.com/dmatveev/openvino/pull/19/files#diff-b96c675e99b3f5c4633066fdd94631c676fd3d2b21e1c8c008abfbf893cc4bc9
The stride parameter was recently introduced and supported as I know.
Is it compatible with earlier SW (drivers / compiler)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stride now is supported within NPUW as per https://github.com/openvinotoolkit/openvino/pull/32025/files#diff-080aa03dd745b82397d17423d554abaa2c9d0bd57c66796470a237b607798232R203 (we do copy on host if there is stride). With this PR I get accurate results for both 1k and 4k inputs
|
We can get correct results with the tensor strides fix in 4ab490b. Please note that pyramid attention need to be changed accordingly like below because the past kv tensor is not continuous when it's shared. Do you think the changes can be included in this PR? Thanks! |
…into as/npuw_share_kvcache
Done, thanks! |
|
build_jenkins |
intelgaoxiong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank u! @smirnov-alexey
Depends on lazy I/O #32277
Sharing kvcache taken from dmatveev#19 (kudos to Xiong)