Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static llm pipeline: stateful model #1240

Conversation

AsyaPronina
Copy link
Contributor

@AsyaPronina AsyaPronina commented Nov 20, 2024

@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) category: samples GenAI samples labels Nov 20, 2024
@AsyaPronina AsyaPronina marked this pull request as draft November 20, 2024 19:26
Comment on lines 806 to 883
int64_t position_ids_data = prompt_len -1;
std::vector<int64_t> attention_mask_data(1, prompt_len);
int64_t position_ids_data = prompt_len - 1;
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, @TolyaTalamanov !!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

udoli

@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 6cdd518 to cc34616 Compare November 27, 2024 15:41
@github-actions github-actions bot removed the category: samples GenAI samples label Nov 30, 2024
@AsyaPronina AsyaPronina marked this pull request as ready for review November 30, 2024 00:45
@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 306ab4a to 7c8ff06 Compare November 30, 2024 03:10
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Nov 30, 2024
### Details:
 - *item1*
 - *...*
 
### Related PRs:
- GenAI: *openvinotoolkit/openvino.genai#1240
### Tickets:
 - *ticket-id*

---------

Co-authored-by: TolyaTalamanov <[email protected]>
@dmatveev dmatveev added this to the 2025.0 milestone Dec 5, 2024
@dmatveev dmatveev self-assigned this Dec 5, 2024
const ov::AnyMap& config);
};

class SMStaticLLMPipeline : public LLMPipelineImplBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMStaticLLMPipeline sounds misleading, let's not focus on the point that it is a "single model" pipeline (as people got used to do a different thing here).

The CPU/GPU's pipeline is called Stateful* if I get it right.

So as this one is still static, let's call it StaticStatefulLLMPipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope you read the next comment. :)

Comment on lines 11 to 13
namespace genai {

struct StaticLLMPipelineFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Contributor Author

@AsyaPronina AsyaPronina Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatveev @TolyaTalamanov , please help to disambiguate: we allow user to pass dynamic stateful OpenVINO model into our new pipeline, where we are hiding things like converting of model to static and making it stateless. Should we this way still name the pipeline as static_llm::StatefulLLMPipeline, as it still works with the static and stateless models inside? Or it can really be named as Stateful because now, LLMCompiledModel, which it creates, doesn't expose to user additional inputs and outputs, that correspond to states. (However, I don't know if by this logic the pipeline is still static)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

Why not just npu_impl?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AsyaPronina The new pipeline is definitely stateful as pipeline doesn't handle dynamism and states any longer. Nobody cares what is inside

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just npu_impl?

Previously there was nothing about NPU

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device NPU, config NPUW


update_config(properties, {"NPU_USE_NPUW", "YES"});
update_config(properties, {"NPUW_LLM", "YES"});
update_config(properties, {"NPUW_LLM_MODEL_DESC", model_desc_to_string(model_desc)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is C++, can we use a C++ structure directly here? Or the option is exposed as string only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A downside - change in the option structure when it is a string will never break the build, but a change in the structure type will.

Copy link
Contributor Author

@AsyaPronina AsyaPronina Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API. However, we can pass it as map<std::string, std::string> or map<std::string, ov::Any>, if you think it is better. Current implementation via std::string ensures that compiled_model.get_property("NPUW_LLM_MODEL_DESC") will print something meaningful (but it is not a requirement).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API.

So you say our NPU parameters are not exported via headers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out StaticLLMPipeline parameters are documented stings, not exposed through headers

Comment on lines 636 to 637
const uint32_t kMaxPromptLen = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u);
const uint32_t kMinResponseLen = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the above code, it was aligned to 64. Probably it makes sense to unify how these options are handled between the two classes (without overdesign)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did the alignment inside of the LLMCompiledModel constructor in the NPUW. Do you think I need to remove it there and instead do alignment here? I did it in LLMCompiledModel, since I thought it might be of implementation detail..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Comment on lines +682 to +793
auto decode_start_time = std::chrono::steady_clock::now();
DecodedResults decoded_results = {m_tokenizer.decode(encoded_results.tokens), encoded_results.scores};
auto decode_stop_time = std::chrono::steady_clock::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd highly recommend to use smt like https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/perf.cpp#L9 - but it is clearly not for this PR (cc: @TolyaTalamanov )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofc, I don't mind, should be done for stateful implementation as well

src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
const std::string& device,
const ov::AnyMap& config) {
auto properties = config;
const auto use_sm_pipeline = pop_or_default(properties, "USE_SM_PIPELINE", false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be false or NO? Or maybe it shouldn't a binary option either?

we also need to be careful about the option name choice here.. And tbh I don't have a good name here in mind.

It shouldn't be a public-looking option, that's for sure. Maybe it shouldn't be a configurable option at all but an env var, like we did for memory allocation - but that'd complicate testing in the existing environments.

NPU_PIPELINE = STATEFUL (as opposed to the today's STATELESS)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Will fix this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can set false and NO , OpenVINO ov::Any can parse it to the bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

src/cpp/src/llm_pipeline.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.hpp Show resolved Hide resolved
@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from bef8ca8 to c3ba7e2 Compare December 31, 2024 15:03
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Jan 2, 2025
### Details:
- *Added parsing of passed `NPUW_LLM_PREFILL_CONFIG` and
`NPUW_LLM_GENERATE_CONFIG` options*
 - *Added parsing of passed `NPUW_LLM_PAD_TOKEN_ID`*

### Tickets:
 - *EISW-149349*
 - *EISW-149350*

### Related PRs:
- OpenVINO GenAI:
openvinotoolkit/openvino.genai#1240
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
@@ -632,12 +657,290 @@ void copy_columns_by_row_chunks(const ov::Tensor& src, ov::Tensor& dst) {
}
}

enum NpuPipeline {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about these:

  • NPUPipelineType
  • NPUPipelineKind
  • NPUPipelineImpl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHY NPU AGAIN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, we do pass NPUW options directly in the pipeline code, so why not. Everything works but not Npu please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StaticPipelineType, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the option NPU_PIPELINE be changed also to STATIC_PIPELINE?

ov::AnyMap properties = config;

auto compiled = setupAndCompileModel(model, model_desc, properties);
m_request = compiled->create_infer_request();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space?

return std::make_shared<ov::CompiledModel>(genai::utils::singleton_core().compile_model(model, "NPU", pipeline_config));
}

DecodedResults StatefulLLMPipeline::generate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the complete copy-paste from StatelessLLMPipeline, does it make sense to re-use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, maybe isolate into common util function (simple and w/o inheritance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple function won't work per my opinion, as it should call method of the class inside (generate() for encoder) and lambda-s won't look good I think. I suggest to do base class in a separate task

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss it first

int64_t input_ids_data = -1;
int64_t position_ids_data = prompt_len - 1;
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, can we set tensors once and then only update them?

m_request.set_tensor(input_ids);
while (;;) {
    m_request.get_tensor(input_ids) <- write there
}

Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even like this:

input_ids_data = -1;
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));

while (;;) {
    input_ids_data = last_token;
    m_request.infer();
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't work with attention mask though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attention_mask shape changes per iteration, for other tensors we can, but it will look like a bit shady, at least comment should be provided

int64_t input_ids_data = -1;
int64_t position_ids_data = prompt_len - 1;
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just more preferable than C-style casts because of several reasons

int64_t input_ids_data = -1;
int64_t position_ids_data = prompt_len - 1;
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't work with attention mask though

@@ -660,7 +963,7 @@ StaticLLMPipeline::StaticLLMPipeline(
const auto use_blobs = pop_or_default(properties, "USE_BLOBS", false);
if (!use_blobs) {
ModelConfigDesc model_desc = get_modeldesc_from_json(models_path / "config.json");
auto model = genai::utils::singleton_core().read_model(models_path / "openvino_model.xml", {}, properties);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please, keep master's branch version of read_model

src/cpp/src/llm_pipeline_static.cpp Show resolved Hide resolved
@dmatveev dmatveev changed the title Static llm pipeline dynamic shape model Static llm pipeline: stateful model Jan 3, 2025
@@ -660,7 +963,7 @@ StaticLLMPipeline::StaticLLMPipeline(
const auto use_blobs = pop_or_default(properties, "USE_BLOBS", false);
if (!use_blobs) {
ModelConfigDesc model_desc = get_modeldesc_from_json(models_path / "config.json");
auto model = genai::utils::singleton_core().read_model(models_path / "openvino_model.xml", {}, properties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please, keep master's branch version of read_model

src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
AsyaPronina and others added 4 commits January 5, 2025 15:41
commit c52bd12
Author: Anastasiya Pronina <[email protected]>
Date:   Tue Dec 24 02:27:49 2024 +0000

    Fixed merge

commit ef82087
Merge: b00d987 3496d45
Author: Anastasiya Pronina <[email protected]>
Date:   Tue Dec 24 02:00:10 2024 +0000

    Merge branch 'master' of https://github.com/openvinotoolkit/openvino.genai into at/static-llm-pipeline-dynamic-shape-model

commit b00d987
Author: Anastasiya Pronina <[email protected]>
Date:   Tue Dec 24 00:40:00 2024 +0000

    Fixed according to review comments

commit 07f2b43
Author: Anastasiya Pronina <[email protected]>
Date:   Fri Dec 20 01:15:26 2024 +0000

    Pass PREFILL/GENERATE configs, pad_token_id and support chat mode

commit 7c8ff06
Author: Anastasiya Pronina <[email protected]>
Date:   Sat Nov 30 00:43:38 2024 +0000

    Removed testing definitions in sample

commit b2fc44b
Author: Anastasiya Pronina <[email protected]>
Date:   Sat Nov 30 00:39:24 2024 +0000

    NPUW_MODEL_DESC property made as string

commit cc34616
Author: Anastasiya Pronina <[email protected]>
Date:   Mon Nov 25 02:33:14 2024 +0000

    Fixed all typos comparing to dual-model GenAI pipeline

commit e0416c6
Author: TolyaTalamanov <[email protected]>
Date:   Sun Nov 17 16:50:35 2024 +0000

    Snapshot

commit d0b0298
Author: TolyaTalamanov <[email protected]>
Date:   Thu Nov 14 15:49:40 2024 +0000

    Snapshot
@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from a9cae71 to 65d000b Compare January 5, 2025 17:30
src/cpp/src/llm_pipeline_static.cpp Outdated Show resolved Hide resolved
}
// Using model_str and weights_tensor with blobs is meaningless.
if (use_blobs) {
OPENVINO_THROW("Blobs cannot be used with model string and weights tensor");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition + throw = OPENVINO_ASSERT(!use_blobs, "Blobs cannot be used with model string and weights tensor")

the same in other places.

Copy link
Contributor Author

@AsyaPronina AsyaPronina Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this, Anatolii suggested OPENVINO_THROW instead, @TolyaTalamanov, what should be the final option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was wrong, I've used OPENVINO_ASSERT(!use_blobs && "Blobs cannot be used with model string and weights tensor") instead of OPENVINO_ASSERT(!use_blobs, "Blobs cannot be used with model string and weights tensor"). Fixed now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer having Exception rather than Assert in cases where the error is important for user...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is the same as exception, but on C++ syntax level you write a single line instead of condition + throw

it's not C's assert which is working only in debug mode

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, then it makes sense

github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Jan 6, 2025
…atic_llm::StatefulLLMPipeline (#28267)

### Details:
- *Refactoring LLMCompiledModel according to comments in
openvinotoolkit/openvino.genai#1240

### Tickets:
 - *ticket-id*
if (last_token == config.eos_token_id && !config.ignore_eos) {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like Sampler should be used here similar to #1431

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe next iteration

@TolyaTalamanov TolyaTalamanov added this pull request to the merge queue Jan 6, 2025
Merged via the queue into openvinotoolkit:master with commit db71b36 Jan 6, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: NPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants