Skip to content

Conversation

@smirnov-alexey
Copy link
Contributor

Depends on lazy I/O #32277
Sharing kvcache taken from dmatveev#19 (kudos to Xiong)

auto tokens_in_past_chunks = kvcache_desc.num_stored_tokens - m_tokens_in_present_chunk;
if (tokens_in_past_chunks > 0) {
auto prefill_past_kv = m_prefill_request->get_tensor(m_prefill_in_ports.at(input_name));
auto prefill_past_kv_chunks = uu::make_tensor_slice(prefill_past_kv,
Copy link
Contributor Author

@smirnov-alexey smirnov-alexey Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue in changes in copy_kvcache() (segfault/double free or corruption/.as() access)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was coming from #32277. Fixed in a54d529

@dmatveev dmatveev added this to the 2026.0 milestone Oct 31, 2025
pre_kv_dim,
0u,
static_cast<uint32_t>(tokens_in_past_chunks));
auto prefill_past_kv_chunks = make_tensor_slice(prefill_past_kv,
Copy link
Contributor

@intelgaoxiong intelgaoxiong Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smirnov-alexey
Given past KV is shared now, we should need a temp buffer to reorder the data like this?
https://github.com/dmatveev/openvino/pull/19/files#diff-b96c675e99b3f5c4633066fdd94631c676fd3d2b21e1c8c008abfbf893cc4bc9R546

Otherwise, the results should be incorrect.

Data corruption will happen in line 482:
uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);

Copy from buffer A to buffer A will cause data corruption.
That's why we need a temp buffer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we now can create strided tensor on line 297 then we can omit a copy here and omit uu::copy_tensor_by_dim for past kv tensors as we need only one for present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, might be old code should be preserved for the case pre_kv_dim != past_kv_dim


uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);
if (!m_past_kv_bound) {
uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smirnov-alexey Can you get correct results?
I'm seeing accuracy issue cause by missing the changes here.
https://github.com/dmatveev/openvino/pull/19/files#diff-b96c675e99b3f5c4633066fdd94631c676fd3d2b21e1c8c008abfbf893cc4bc9

The stride parameter was recently introduced and supported as I know.
Is it compatible with earlier SW (drivers / compiler)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stride now is supported within NPUW as per https://github.com/openvinotoolkit/openvino/pull/32025/files#diff-080aa03dd745b82397d17423d554abaa2c9d0bd57c66796470a237b607798232R203 (we do copy on host if there is stride). With this PR I get accurate results for both 1k and 4k inputs

@intelgaoxiong
Copy link
Contributor

intelgaoxiong commented Nov 9, 2025

We can get correct results with the tensor strides fix in 4ab490b.
Thank u! @smirnov-alexey

Please note that pyramid attention need to be changed accordingly like below because the past kv tensor is not continuous when it's shared.

Do you think the changes can be included in this PR? Thanks!

diff --git a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
index c626ac3cdd..bf728e9775 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
@@ -800,7 +800,7 @@ void ov::npuw::IBaseInferRequest::bind_pyramid_attention_inputs(std::size_t idx,
             LOG_BLOCK();

             // Optimization for the last chunk: Direct tensor reuse when shapes match
-            if (static_cast<int64_t>(input_shape[param.dim]) == past_len) {
+            if (static_cast<int64_t>(input_shape[param.dim]) == past_len && input->is_continuous()) {
                 request->set_tensor(iport, input);
                 continue;
             }

@smirnov-alexey
Copy link
Contributor Author

We can get correct results with the tensor strides fix in 4ab490b. Thank u! @smirnov-alexey

Please note that pyramid attention need to be changed accordingly like below because the past kv tensor is not continuous when it's shared.

Do you think the changes can be included in this PR? Thanks!

diff --git a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
index c626ac3cdd..bf728e9775 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
@@ -800,7 +800,7 @@ void ov::npuw::IBaseInferRequest::bind_pyramid_attention_inputs(std::size_t idx,
             LOG_BLOCK();

             // Optimization for the last chunk: Direct tensor reuse when shapes match
-            if (static_cast<int64_t>(input_shape[param.dim]) == past_len) {
+            if (static_cast<int64_t>(input_shape[param.dim]) == past_len && input->is_continuous()) {
                 request->set_tensor(iport, input);
                 continue;
             }

Done, thanks!

@smirnov-alexey
Copy link
Contributor Author

build_jenkins

Copy link
Contributor

@intelgaoxiong intelgaoxiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thank u! @smirnov-alexey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants