Static llm pipeline: stateful model #1240

AsyaPronina · 2024-11-20T19:24:56Z

Related PRs:

OpenVINO: [NPUW] Dynamic stateful model support openvino#27651
OpenVINO Unroll SPDA: [NPUW] Port unroll SDPA optimization from GenAI openvino#27891
OpenVINO Prefill/Generate sections: NPUW: Enable PREFILL/GENERATE configs in LLMCompiledModel openvino#28154
OpenVINO LLMCompiledModel refactoring: Refactoring LLMCompiledModel according to review comments in GenAI static_llm::StatefulLLMPipeline openvino#28267

dmatveev · 2024-11-25T10:24:54Z

src/cpp/src/llm_pipeline_static.cpp

-    int64_t position_ids_data = prompt_len -1;
-    std::vector<int64_t> attention_mask_data(1, prompt_len);
+    int64_t position_ids_data = prompt_len - 1;
+    std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);


LOL, @TolyaTalamanov !!

samples/cpp/chat_sample/chat_sample.cpp

src/cpp/src/llm_pipeline_static.cpp

### Details: - *item1* - *...* ### Related PRs: - GenAI: *openvinotoolkit/openvino.genai#1240 ### Tickets: - *ticket-id* --------- Co-authored-by: TolyaTalamanov <[email protected]>

dmatveev · 2024-12-05T18:59:29Z

src/cpp/src/llm_pipeline_static.hpp

+                                                       const ov::AnyMap& config);
+};
+
+class SMStaticLLMPipeline : public LLMPipelineImplBase {


SMStaticLLMPipeline sounds misleading, let's not focus on the point that it is a "single model" pipeline (as people got used to do a different thing here).

The CPU/GPU's pipeline is called Stateful* if I get it right.

So as this one is still static, let's call it StaticStatefulLLMPipeline?

Thanks, sure!

Hope you read the next comment. :)

dmatveev · 2024-12-05T19:01:21Z

src/cpp/src/llm_pipeline_static.hpp

 namespace genai {

+struct StaticLLMPipelineFactory {


I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

@dmatveev @TolyaTalamanov , please help to disambiguate: we allow user to pass dynamic stateful OpenVINO model into our new pipeline, where we are hiding things like converting of model to static and making it stateless. Should we this way still name the pipeline as static_llm::StatefulLLMPipeline, as it still works with the static and stateless models inside? Or it can really be named as Stateful because now, LLMCompiledModel, which it creates, doesn't expose to user additional inputs and outputs, that correspond to states. (However, I don't know if by this logic the pipeline is still static)

I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

Why not just npu_impl?

@AsyaPronina The new pipeline is definitely stateful as pipeline doesn't handle dynamism and states any longer. Nobody cares what is inside

Why not just npu_impl?

Previously there was nothing about NPU

device NPU, config NPUW

dmatveev · 2024-12-05T19:02:39Z

src/cpp/src/llm_pipeline_static.cpp

+
+    update_config(properties, {"NPU_USE_NPUW", "YES"});
+    update_config(properties, {"NPUW_LLM", "YES"});
+    update_config(properties, {"NPUW_LLM_MODEL_DESC", model_desc_to_string(model_desc)});


since it is C++, can we use a C++ structure directly here? Or the option is exposed as string only?

A downside - change in the option structure when it is a string will never break the build, but a change in the structure type will.

We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API. However, we can pass it as map<std::string, std::string> or map<std::string, ov::Any>, if you think it is better. Current implementation via std::string ensures that compiled_model.get_property("NPUW_LLM_MODEL_DESC") will print something meaningful (but it is not a requirement).

What do you think?

We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API.

So you say our NPU parameters are not exported via headers?

out StaticLLMPipeline parameters are documented stings, not exposed through headers

dmatveev · 2024-12-05T19:04:29Z

src/cpp/src/llm_pipeline_static.cpp

+    const uint32_t kMaxPromptLen = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u);
+    const uint32_t kMinResponseLen = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u);


see the above code, it was aligned to 64. Probably it makes sense to unify how these options are handled between the two classes (without overdesign)

Yes, I did the alignment inside of the LLMCompiledModel constructor in the NPUW. Do you think I need to remove it there and instead do alignment here? I did it in LLMCompiledModel, since I thought it might be of implementation detail..

What do you think?

dmatveev · 2024-12-05T19:07:33Z

src/cpp/src/llm_pipeline_static.cpp

+    auto decode_start_time =  std::chrono::steady_clock::now();
+    DecodedResults decoded_results = {m_tokenizer.decode(encoded_results.tokens), encoded_results.scores};
+    auto decode_stop_time =  std::chrono::steady_clock::now();


I'd highly recommend to use smt like https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/perf.cpp#L9 - but it is clearly not for this PR (cc: @TolyaTalamanov )

ofc, I don't mind, should be done for stateful implementation as well

src/cpp/src/llm_pipeline_static.cpp

dmatveev · 2024-12-05T19:14:29Z

src/cpp/src/llm_pipeline_static.cpp

+                                 const std::string& device,
+                                 const ov::AnyMap& config) {
+    auto properties = config;
+    const auto use_sm_pipeline = pop_or_default(properties, "USE_SM_PIPELINE", false);


shouldn't it be false or NO? Or maybe it shouldn't a binary option either?

we also need to be careful about the option name choice here.. And tbh I don't have a good name here in mind.

It shouldn't be a public-looking option, that's for sure. Maybe it shouldn't be a configurable option at all but an env var, like we did for memory allocation - but that'd complicate testing in the existing environments.

NPU_PIPELINE = STATEFUL (as opposed to the today's STATELESS)?

Thank you! Will fix this

User can set false and NO , OpenVINO ov::Any can parse it to the bool.

src/cpp/src/llm_pipeline.cpp

src/cpp/src/llm_pipeline_static.cpp

src/cpp/src/llm_pipeline_static.hpp

### Details: - *Added parsing of passed `NPUW_LLM_PREFILL_CONFIG` and `NPUW_LLM_GENERATE_CONFIG` options* - *Added parsing of passed `NPUW_LLM_PAD_TOKEN_ID`* ### Tickets: - *EISW-149349* - *EISW-149350* ### Related PRs: - OpenVINO GenAI: openvinotoolkit/openvino.genai#1240

src/cpp/src/llm_pipeline_static.cpp

TolyaTalamanov · 2025-01-03T16:09:09Z

src/cpp/src/llm_pipeline_static.cpp

@@ -632,12 +657,290 @@ void copy_columns_by_row_chunks(const ov::Tensor& src, ov::Tensor& dst) {
    }
 }

+enum NpuPipeline {


How about these:

NPUPipelineType

NPUPipelineKind

NPUPipelineImpl

WHY NPU AGAIN

Anyways, we do pass NPUW options directly in the pipeline code, so why not. Everything works but not Npu please.

StaticPipelineType, etc

Should the option NPU_PIPELINE be changed also to STATIC_PIPELINE?

TolyaTalamanov · 2025-01-03T16:11:32Z

src/cpp/src/llm_pipeline_static.cpp

+    ov::AnyMap properties = config;
+
+    auto compiled = setupAndCompileModel(model, model_desc, properties);
+    m_request  = compiled->create_infer_request();


extra space?

TolyaTalamanov · 2025-01-03T16:14:10Z

src/cpp/src/llm_pipeline_static.cpp

+    return std::make_shared<ov::CompiledModel>(genai::utils::singleton_core().compile_model(model, "NPU", pipeline_config));
+}
+
+DecodedResults StatefulLLMPipeline::generate(


Isn't it the complete copy-paste from StatelessLLMPipeline, does it make sense to re-use?

I mean, maybe isolate into common util function (simple and w/o inheritance)

Simple function won't work per my opinion, as it should call method of the class inside (generate() for encoder) and lambda-s won't look good I think. I suggest to do base class in a separate task

Let's discuss it first

TolyaTalamanov · 2025-01-03T16:17:49Z

src/cpp/src/llm_pipeline_static.cpp

+    int64_t input_ids_data = -1;
+    int64_t position_ids_data = prompt_len - 1;
+    std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
+    m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));


Wondering, can we set tensors once and then only update them?

m_request.set_tensor(input_ids); while (;;) { m_request.get_tensor(input_ids) <- write there }

Or even like this:

input_ids_data = -1; m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data)); while (;;) { input_ids_data = last_token; m_request.infer(); }

Won't work with attention mask though

attention_mask shape changes per iteration, for other tensors we can, but it will look like a bit shady, at least comment should be provided

TolyaTalamanov · 2025-01-03T16:19:48Z

src/cpp/src/llm_pipeline_static.cpp

+    int64_t input_ids_data = -1;
+    int64_t position_ids_data = prompt_len - 1;
+    std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
+    m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));


reinterpret_cast?

Just more preferable than C-style casts because of several reasons

TolyaTalamanov · 2025-01-03T16:20:40Z

src/cpp/src/llm_pipeline_static.cpp

+    int64_t input_ids_data = -1;
+    int64_t position_ids_data = prompt_len - 1;
+    std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
+    m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));


Won't work with attention mask though

TolyaTalamanov · 2025-01-03T16:24:32Z

src/cpp/src/llm_pipeline_static.cpp

@@ -660,7 +963,7 @@ StaticLLMPipeline::StaticLLMPipeline(
    const auto use_blobs = pop_or_default(properties, "USE_BLOBS", false);
    if (!use_blobs) {
        ModelConfigDesc model_desc = get_modeldesc_from_json(models_path / "config.json");
-        auto model = genai::utils::singleton_core().read_model(models_path / "openvino_model.xml", {}, properties);


@ilya-lavrenov, it was changed by you last time:
https://github.com/openvinotoolkit/openvino.genai/blame/master/src/cpp/src/llm_pipeline_static.cpp#L663

yes, please, keep master's branch version of read_model

src/cpp/src/llm_pipeline_static.cpp

ilya-lavrenov · 2025-01-03T18:26:38Z

src/cpp/src/llm_pipeline_static.cpp

@@ -660,7 +963,7 @@ StaticLLMPipeline::StaticLLMPipeline(
    const auto use_blobs = pop_or_default(properties, "USE_BLOBS", false);
    if (!use_blobs) {
        ModelConfigDesc model_desc = get_modeldesc_from_json(models_path / "config.json");
-        auto model = genai::utils::singleton_core().read_model(models_path / "openvino_model.xml", {}, properties);


yes, please, keep master's branch version of read_model

src/cpp/src/llm_pipeline_static.cpp

commit c52bd12 Author: Anastasiya Pronina <[email protected]> Date: Tue Dec 24 02:27:49 2024 +0000 Fixed merge commit ef82087 Merge: b00d987 3496d45 Author: Anastasiya Pronina <[email protected]> Date: Tue Dec 24 02:00:10 2024 +0000 Merge branch 'master' of https://github.com/openvinotoolkit/openvino.genai into at/static-llm-pipeline-dynamic-shape-model commit b00d987 Author: Anastasiya Pronina <[email protected]> Date: Tue Dec 24 00:40:00 2024 +0000 Fixed according to review comments commit 07f2b43 Author: Anastasiya Pronina <[email protected]> Date: Fri Dec 20 01:15:26 2024 +0000 Pass PREFILL/GENERATE configs, pad_token_id and support chat mode commit 7c8ff06 Author: Anastasiya Pronina <[email protected]> Date: Sat Nov 30 00:43:38 2024 +0000 Removed testing definitions in sample commit b2fc44b Author: Anastasiya Pronina <[email protected]> Date: Sat Nov 30 00:39:24 2024 +0000 NPUW_MODEL_DESC property made as string commit cc34616 Author: Anastasiya Pronina <[email protected]> Date: Mon Nov 25 02:33:14 2024 +0000 Fixed all typos comparing to dual-model GenAI pipeline commit e0416c6 Author: TolyaTalamanov <[email protected]> Date: Sun Nov 17 16:50:35 2024 +0000 Snapshot commit d0b0298 Author: TolyaTalamanov <[email protected]> Date: Thu Nov 14 15:49:40 2024 +0000 Snapshot

Co-authored-by: Ilya Lavrenov <[email protected]>

src/cpp/src/llm_pipeline_static.cpp

ilya-lavrenov · 2025-01-05T19:51:33Z

src/cpp/src/llm_pipeline_static.cpp

+    }
+    // Using model_str and weights_tensor with blobs is meaningless.
+    if (use_blobs) {
+        OPENVINO_THROW("Blobs cannot be used with model string and weights tensor");


condition + throw = OPENVINO_ASSERT(!use_blobs, "Blobs cannot be used with model string and weights tensor")

the same in other places.

Did this, Anatolii suggested OPENVINO_THROW instead, @TolyaTalamanov, what should be the final option?

Sorry, I was wrong, I've used OPENVINO_ASSERT(!use_blobs && "Blobs cannot be used with model string and weights tensor") instead of OPENVINO_ASSERT(!use_blobs, "Blobs cannot be used with model string and weights tensor"). Fixed now

I'd prefer having Exception rather than Assert in cases where the error is important for user...

assert is the same as exception, but on C++ syntax level you write a single line instead of condition + throw

it's not C's assert which is working only in debug mode

Ah ok, then it makes sense

…atic_llm::StatefulLLMPipeline (#28267) ### Details: - *Refactoring LLMCompiledModel according to comments in openvinotoolkit/openvino.genai#1240 ### Tickets: - *ticket-id*

ilya-lavrenov · 2025-01-06T07:56:14Z

src/cpp/src/llm_pipeline_static.cpp

+        if (last_token == config.eos_token_id && !config.ignore_eos) {
+            break;
+        }
+    }


looks like Sampler should be used here similar to #1431

Yes, I believe next iteration

github-actions bot added category: LLM LLM pipeline (stateful, static) category: samples GenAI samples labels Nov 20, 2024

AsyaPronina marked this pull request as draft November 20, 2024 19:26

AsyaPronina mentioned this pull request Nov 20, 2024

[NPUW] Dynamic stateful model support openvinotoolkit/openvino#27651

Merged

dmatveev reviewed Nov 25, 2024

View reviewed changes

AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 6cdd518 to cc34616 Compare November 27, 2024 15:41

smirnov-alexey reviewed Nov 27, 2024

View reviewed changes

samples/cpp/chat_sample/chat_sample.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed Nov 27, 2024

View reviewed changes

src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved