-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static llm pipeline: stateful model #1240
Static llm pipeline: stateful model #1240
Conversation
src/cpp/src/llm_pipeline_static.cpp
Outdated
int64_t position_ids_data = prompt_len -1; | ||
std::vector<int64_t> attention_mask_data(1, prompt_len); | ||
int64_t position_ids_data = prompt_len - 1; | ||
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL, @TolyaTalamanov !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udoli
6cdd518
to
cc34616
Compare
306ab4a
to
7c8ff06
Compare
### Details: - *item1* - *...* ### Related PRs: - GenAI: *openvinotoolkit/openvino.genai#1240 ### Tickets: - *ticket-id* --------- Co-authored-by: TolyaTalamanov <[email protected]>
src/cpp/src/llm_pipeline_static.hpp
Outdated
const ov::AnyMap& config); | ||
}; | ||
|
||
class SMStaticLLMPipeline : public LLMPipelineImplBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMStaticLLMPipeline sounds misleading, let's not focus on the point that it is a "single model" pipeline (as people got used to do a different thing here).
The CPU/GPU's pipeline is called Stateful* if I get it right.
So as this one is still static, let's call it StaticStatefulLLMPipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope you read the next comment. :)
src/cpp/src/llm_pipeline_static.hpp
Outdated
namespace genai { | ||
|
||
struct StaticLLMPipelineFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there could be a better namespace job done, clearly statik::
could be a namespace here, so we'd get statik::StatelessLLMPipeline
(the old class) and statik::StatefulLLMPipeline
(the new one).
statik::
is picked to avoid the clash with the keyword, it could be Static::
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmatveev @TolyaTalamanov , please help to disambiguate: we allow user to pass dynamic stateful OpenVINO model into our new pipeline, where we are hiding things like converting of model to static and making it stateless. Should we this way still name the pipeline as static_llm::StatefulLLMPipeline
, as it still works with the static and stateless models inside? Or it can really be named as Stateful
because now, LLMCompiledModel
, which it creates, doesn't expose to user additional inputs and outputs, that correspond to states. (However, I don't know if by this logic the pipeline is still static)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there could be a better namespace job done, clearly
statik::
could be a namespace here, so we'd getstatik::StatelessLLMPipeline
(the old class) andstatik::StatefulLLMPipeline
(the new one).
statik::
is picked to avoid the clash with the keyword, it could beStatic::
too.
Why not just npu_impl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AsyaPronina The new pipeline is definitely stateful
as pipeline doesn't handle dynamism and states any longer. Nobody cares what is inside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just npu_impl?
Previously there was nothing about NPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device NPU, config NPUW
src/cpp/src/llm_pipeline_static.cpp
Outdated
|
||
update_config(properties, {"NPU_USE_NPUW", "YES"}); | ||
update_config(properties, {"NPUW_LLM", "YES"}); | ||
update_config(properties, {"NPUW_LLM_MODEL_DESC", model_desc_to_string(model_desc)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is C++, can we use a C++ structure directly here? Or the option is exposed as string only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A downside - change in the option structure when it is a string will never break the build, but a change in the structure type will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API. However, we can pass it as map<std::string, std::string>
or map<std::string, ov::Any>
, if you think it is better. Current implementation via std::string
ensures that compiled_model.get_property("NPUW_LLM_MODEL_DESC")
will print something meaningful (but it is not a requirement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API.
So you say our NPU parameters are not exported via headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out StaticLLMPipeline
parameters are documented stings, not exposed through headers
src/cpp/src/llm_pipeline_static.cpp
Outdated
const uint32_t kMaxPromptLen = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u); | ||
const uint32_t kMinResponseLen = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the above code, it was aligned to 64. Probably it makes sense to unify how these options are handled between the two classes (without overdesign)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did the alignment inside of the LLMCompiledModel
constructor in the NPUW. Do you think I need to remove it there and instead do alignment here? I did it in LLMCompiledModel
, since I thought it might be of implementation detail..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
auto decode_start_time = std::chrono::steady_clock::now(); | ||
DecodedResults decoded_results = {m_tokenizer.decode(encoded_results.tokens), encoded_results.scores}; | ||
auto decode_stop_time = std::chrono::steady_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd highly recommend to use smt like https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/perf.cpp#L9 - but it is clearly not for this PR (cc: @TolyaTalamanov )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ofc, I don't mind, should be done for stateful
implementation as well
src/cpp/src/llm_pipeline_static.cpp
Outdated
const std::string& device, | ||
const ov::AnyMap& config) { | ||
auto properties = config; | ||
const auto use_sm_pipeline = pop_or_default(properties, "USE_SM_PIPELINE", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be false
or NO
? Or maybe it shouldn't a binary option either?
we also need to be careful about the option name choice here.. And tbh I don't have a good name here in mind.
It shouldn't be a public-looking option, that's for sure. Maybe it shouldn't be a configurable option at all but an env var, like we did for memory allocation - but that'd complicate testing in the existing environments.
NPU_PIPELINE = STATEFUL
(as opposed to the today's STATELESS
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Will fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User can set false
and NO
, OpenVINO ov::Any
can parse it to the bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
34c3fc0
to
c52bd12
Compare
bef8ca8
to
c3ba7e2
Compare
### Details: - *Added parsing of passed `NPUW_LLM_PREFILL_CONFIG` and `NPUW_LLM_GENERATE_CONFIG` options* - *Added parsing of passed `NPUW_LLM_PAD_TOKEN_ID`* ### Tickets: - *EISW-149349* - *EISW-149350* ### Related PRs: - OpenVINO GenAI: openvinotoolkit/openvino.genai#1240
src/cpp/src/llm_pipeline_static.cpp
Outdated
@@ -632,12 +657,290 @@ void copy_columns_by_row_chunks(const ov::Tensor& src, ov::Tensor& dst) { | |||
} | |||
} | |||
|
|||
enum NpuPipeline { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about these:
NPUPipelineType
NPUPipelineKind
NPUPipelineImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHY NPU AGAIN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, we do pass NPUW options directly in the pipeline code, so why not. Everything works but not Npu
please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StaticPipelineType
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the option NPU_PIPELINE
be changed also to STATIC_PIPELINE
?
src/cpp/src/llm_pipeline_static.cpp
Outdated
ov::AnyMap properties = config; | ||
|
||
auto compiled = setupAndCompileModel(model, model_desc, properties); | ||
m_request = compiled->create_infer_request(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space?
return std::make_shared<ov::CompiledModel>(genai::utils::singleton_core().compile_model(model, "NPU", pipeline_config)); | ||
} | ||
|
||
DecodedResults StatefulLLMPipeline::generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the complete copy-paste from StatelessLLMPipeline
, does it make sense to re-use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, maybe isolate into common util function (simple and w/o inheritance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple function won't work per my opinion, as it should call method of the class inside (generate()
for encoder) and lambda-s won't look good I think. I suggest to do base class in a separate task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss it first
src/cpp/src/llm_pipeline_static.cpp
Outdated
int64_t input_ids_data = -1; | ||
int64_t position_ids_data = prompt_len - 1; | ||
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1); | ||
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering, can we set tensors once and then only update them?
m_request.set_tensor(input_ids);
while (;;) {
m_request.get_tensor(input_ids) <- write there
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even like this:
input_ids_data = -1;
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
while (;;) {
input_ids_data = last_token;
m_request.infer();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't work with attention mask though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention_mask
shape changes per iteration, for other tensors we can, but it will look like a bit shady, at least comment should be provided
src/cpp/src/llm_pipeline_static.cpp
Outdated
int64_t input_ids_data = -1; | ||
int64_t position_ids_data = prompt_len - 1; | ||
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1); | ||
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reinterpret_cast
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just more preferable than C-style casts because of several reasons
src/cpp/src/llm_pipeline_static.cpp
Outdated
int64_t input_ids_data = -1; | ||
int64_t position_ids_data = prompt_len - 1; | ||
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1); | ||
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't work with attention mask though
src/cpp/src/llm_pipeline_static.cpp
Outdated
@@ -660,7 +963,7 @@ StaticLLMPipeline::StaticLLMPipeline( | |||
const auto use_blobs = pop_or_default(properties, "USE_BLOBS", false); | |||
if (!use_blobs) { | |||
ModelConfigDesc model_desc = get_modeldesc_from_json(models_path / "config.json"); | |||
auto model = genai::utils::singleton_core().read_model(models_path / "openvino_model.xml", {}, properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov, it was changed by you last time:
https://github.com/openvinotoolkit/openvino.genai/blame/master/src/cpp/src/llm_pipeline_static.cpp#L663
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please, keep master's branch version of read_model
src/cpp/src/llm_pipeline_static.cpp
Outdated
@@ -660,7 +963,7 @@ StaticLLMPipeline::StaticLLMPipeline( | |||
const auto use_blobs = pop_or_default(properties, "USE_BLOBS", false); | |||
if (!use_blobs) { | |||
ModelConfigDesc model_desc = get_modeldesc_from_json(models_path / "config.json"); | |||
auto model = genai::utils::singleton_core().read_model(models_path / "openvino_model.xml", {}, properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please, keep master's branch version of read_model
commit c52bd12 Author: Anastasiya Pronina <[email protected]> Date: Tue Dec 24 02:27:49 2024 +0000 Fixed merge commit ef82087 Merge: b00d987 3496d45 Author: Anastasiya Pronina <[email protected]> Date: Tue Dec 24 02:00:10 2024 +0000 Merge branch 'master' of https://github.com/openvinotoolkit/openvino.genai into at/static-llm-pipeline-dynamic-shape-model commit b00d987 Author: Anastasiya Pronina <[email protected]> Date: Tue Dec 24 00:40:00 2024 +0000 Fixed according to review comments commit 07f2b43 Author: Anastasiya Pronina <[email protected]> Date: Fri Dec 20 01:15:26 2024 +0000 Pass PREFILL/GENERATE configs, pad_token_id and support chat mode commit 7c8ff06 Author: Anastasiya Pronina <[email protected]> Date: Sat Nov 30 00:43:38 2024 +0000 Removed testing definitions in sample commit b2fc44b Author: Anastasiya Pronina <[email protected]> Date: Sat Nov 30 00:39:24 2024 +0000 NPUW_MODEL_DESC property made as string commit cc34616 Author: Anastasiya Pronina <[email protected]> Date: Mon Nov 25 02:33:14 2024 +0000 Fixed all typos comparing to dual-model GenAI pipeline commit e0416c6 Author: TolyaTalamanov <[email protected]> Date: Sun Nov 17 16:50:35 2024 +0000 Snapshot commit d0b0298 Author: TolyaTalamanov <[email protected]> Date: Thu Nov 14 15:49:40 2024 +0000 Snapshot
Co-authored-by: Ilya Lavrenov <[email protected]>
a9cae71
to
65d000b
Compare
src/cpp/src/llm_pipeline_static.cpp
Outdated
} | ||
// Using model_str and weights_tensor with blobs is meaningless. | ||
if (use_blobs) { | ||
OPENVINO_THROW("Blobs cannot be used with model string and weights tensor"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condition + throw = OPENVINO_ASSERT(!use_blobs, "Blobs cannot be used with model string and weights tensor")
the same in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this, Anatolii suggested OPENVINO_THROW
instead, @TolyaTalamanov, what should be the final option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was wrong, I've used OPENVINO_ASSERT(!use_blobs && "Blobs cannot be used with model string and weights tensor")
instead of OPENVINO_ASSERT(!use_blobs, "Blobs cannot be used with model string and weights tensor")
. Fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer having Exception
rather than Assert
in cases where the error is important for user...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert is the same as exception, but on C++ syntax level you write a single line instead of condition + throw
it's not C's assert which is working only in debug mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, then it makes sense
…atic_llm::StatefulLLMPipeline (#28267) ### Details: - *Refactoring LLMCompiledModel according to comments in openvinotoolkit/openvino.genai#1240 ### Tickets: - *ticket-id*
if (last_token == config.eos_token_id && !config.ignore_eos) { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like Sampler should be used here similar to #1431
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe next iteration
Related PRs: