[NPUW] Share kvcache between prefill and generate when chunking is enabled #32642

smirnov-alexey · 2025-10-31T15:11:33Z

Depends on lazy I/O #32277
Sharing kvcache taken from dmatveev#19 (kudos to Xiong)

…into as/npuw_lazy_io_alloc

…into as/npuw_share_kvcache

smirnov-alexey · 2025-10-31T16:02:37Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

            auto tokens_in_past_chunks = kvcache_desc.num_stored_tokens - m_tokens_in_present_chunk;
            if (tokens_in_past_chunks > 0) {
                auto prefill_past_kv = m_prefill_request->get_tensor(m_prefill_in_ports.at(input_name));
-                auto prefill_past_kv_chunks = uu::make_tensor_slice(prefill_past_kv,


There is an issue in changes in copy_kvcache() (segfault/double free or corruption/.as() access)

The issue was coming from #32277. Fixed in a54d529

intelgaoxiong · 2025-11-04T00:51:10Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

-                                                                    pre_kv_dim,
-                                                                    0u,
-                                                                    static_cast<uint32_t>(tokens_in_past_chunks));
+                auto prefill_past_kv_chunks = make_tensor_slice(prefill_past_kv,


@smirnov-alexey
Given past KV is shared now, we should need a temp buffer to reorder the data like this?
https://github.com/dmatveev/openvino/pull/19/files#diff-b96c675e99b3f5c4633066fdd94631c676fd3d2b21e1c8c008abfbf893cc4bc9R546

Otherwise, the results should be incorrect.

Data corruption will happen in line 482:
uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);

Copy from buffer A to buffer A will cause data corruption.
That's why we need a temp buffer here.

I think if we now can create strided tensor on line 297 then we can omit a copy here and omit uu::copy_tensor_by_dim for past kv tensors as we need only one for present?

However, might be old code should be preserved for the case pre_kv_dim != past_kv_dim

…into as/npuw_lazy_io_alloc

…exey/openvino into as/npuw_share_kvcache

intelgaoxiong · 2025-11-06T04:07:35Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp


-                uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);
+                if (!m_past_kv_bound) {
+                    uu::copy_tensor_by_dim(prefill_past_kv_chunks, kvcache_past_kv_chunks, pre_kv_dim, gen_kv_dim);


@smirnov-alexey Can you get correct results?
I'm seeing accuracy issue cause by missing the changes here.
https://github.com/dmatveev/openvino/pull/19/files#diff-b96c675e99b3f5c4633066fdd94631c676fd3d2b21e1c8c008abfbf893cc4bc9

The stride parameter was recently introduced and supported as I know.
Is it compatible with earlier SW (drivers / compiler)?

Stride now is supported within NPUW as per https://github.com/openvinotoolkit/openvino/pull/32025/files#diff-080aa03dd745b82397d17423d554abaa2c9d0bd57c66796470a237b607798232R203 (we do copy on host if there is stride). With this PR I get accurate results for both 1k and 4k inputs

intelgaoxiong · 2025-11-09T07:36:15Z

We can get correct results with the tensor strides fix in 4ab490b.
Thank u! @smirnov-alexey

Please note that pyramid attention need to be changed accordingly like below because the past kv tensor is not continuous when it's shared.

Do you think the changes can be included in this PR? Thanks!

diff --git a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
index c626ac3cdd..bf728e9775 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
@@ -800,7 +800,7 @@ void ov::npuw::IBaseInferRequest::bind_pyramid_attention_inputs(std::size_t idx,
             LOG_BLOCK();

             // Optimization for the last chunk: Direct tensor reuse when shapes match
-            if (static_cast<int64_t>(input_shape[param.dim]) == past_len) {
+            if (static_cast<int64_t>(input_shape[param.dim]) == past_len && input->is_continuous()) {
                 request->set_tensor(iport, input);
                 continue;
             }

…into as/npuw_share_kvcache

smirnov-alexey · 2025-11-10T15:13:50Z

We can get correct results with the tensor strides fix in 4ab490b. Thank u! @smirnov-alexey

Please note that pyramid attention need to be changed accordingly like below because the past kv tensor is not continuous when it's shared.

Do you think the changes can be included in this PR? Thanks!

diff --git a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
index c626ac3cdd..bf728e9775 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
@@ -800,7 +800,7 @@ void ov::npuw::IBaseInferRequest::bind_pyramid_attention_inputs(std::size_t idx,
             LOG_BLOCK();

             // Optimization for the last chunk: Direct tensor reuse when shapes match
-            if (static_cast<int64_t>(input_shape[param.dim]) == past_len) {
+            if (static_cast<int64_t>(input_shape[param.dim]) == past_len && input->is_continuous()) {
                 request->set_tensor(iport, input);
                 continue;
             }

Done, thanks!

smirnov-alexey · 2025-11-10T18:40:28Z

build_jenkins

intelgaoxiong

LGTM.
Thank u! @smirnov-alexey

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

…exey/openvino into as/npuw_share_kvcache

smirnov-alexey added 7 commits October 2, 2025 13:54

Introduce lazy memory allocation for ireq's I/O

fcceeb7

Fix no tensor being present in the storage

f01b8d5

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

f6fbf64

…into as/npuw_lazy_io_alloc

Address review comments

bd27bbc

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

b258ed0

…into as/npuw_lazy_io_alloc

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

bd03f34

…into as/npuw_share_kvcache

Copy Xiong's changes

7dab40d

smirnov-alexey requested review from dmatveev and intelgaoxiong October 31, 2025 15:11

smirnov-alexey assigned dmatveev Oct 31, 2025

smirnov-alexey requested review from a team as code owners October 31, 2025 15:11

smirnov-alexey added the do_not_review label Oct 31, 2025

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 31, 2025

smirnov-alexey commented Oct 31, 2025

View reviewed changes

dmatveev added this to the 2026.0 milestone Oct 31, 2025

Remove copy

20af381

intelgaoxiong reviewed Nov 4, 2025

View reviewed changes

smirnov-alexey added 8 commits November 4, 2025 18:19

WIP

a9e2029

Fix concurrency issue with iterator invalidation

a54d529

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

ec23004

…into as/npuw_lazy_io_alloc

Refactoring

a09921f

Fix merge

8146143

Protect get_tensor by mutex

42b7c1d

Merge branch 'as/npuw_lazy_io_alloc' of https://github.com/smirnov-al…

cf01cf8

…exey/openvino into as/npuw_share_kvcache

Disable kv cache sharing when one of the models is transposed

430af31

intelgaoxiong reviewed Nov 6, 2025

View reviewed changes

Fix strides

4ab490b

esmirno approved these changes Nov 7, 2025

View reviewed changes

smirnov-alexey added 2 commits November 10, 2025 15:10

Handle strided tensors - copy on host

05895d9

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

5ececfc

…into as/npuw_share_kvcache

Handle strided tensors in pyramid attention

2183e6a

intelgaoxiong approved these changes Nov 10, 2025

View reviewed changes