-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch NPU LLM execution to ov::genai::StatefulLLMPipeline #1677
base: master
Are you sure you want to change the base?
Switch NPU LLM execution to ov::genai::StatefulLLMPipeline #1677
Conversation
bd7ca15
to
fc14f8e
Compare
2b519fd
to
a9c60e3
Compare
Depends on: #1748 |
e7dcacf
to
f7861fe
Compare
4dcd044
to
949825e
Compare
949825e
to
b300855
Compare
src/cpp/src/utils.cpp
Outdated
// NB: OPTIMIZE_SIZE is only possible when model_path is defined! | ||
// Otherwise, switch to OPTIMIZE_SPEED | ||
if (cache_mode.has_value() && *cache_mode == CacheMode::OPTIMIZE_SIZE && !model_path.empty()) { | ||
compiled = ov::genai::utils::singleton_core().compile_model(model_path, "NPU", properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus way we will read the model by model_path
twice, won't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe your point is the same as @ilya-lavrenov. @smirnov-alexey please explain :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AsyaPronina Actually, we read model once in case of OPTIMIZE_SPEED
and don't explicitly read by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Minor comments left.
src/cpp/src/utils.cpp
Outdated
kv_desc.min_response_len = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u); | ||
update_npu_config(properties, model, kv_pos, kv_desc); | ||
auto cache_mode = get_option<CacheMode>(config, "CACHE_MODE"); | ||
// NB: OPTIMIZE_SIZE is only possible when model_path is defined! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smirnov-alexey will this limitation be fixed in nearest future?
It will be great to avoid custom path for NPU (passing extra models_path
) through a chain of calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov As I understood this is how it work on openvino
side:
https://docs.openvino.ai/2025/api/c_cpp_api/classov_1_1_core.html
This can be more efficient than using the Core::read_model + Core::compile_model(model_in_memory_object) flow, especially for cases when caching is enabled and a cached model is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose those are different things. I suppose you need to check CACHE_DIR instead of CACHE_MODE, then?
Definitely compile_model with path is preferable, but CACHE_MODE optimization (is it about weightless cache? I suppose it should work even with CACHE_MODE = OPTIMIZE_SPEED for NPU) should also work with ov::Model as well, IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, model has been already read via read_model
using a "common" code and some transformations are applied.
So, looks strange that you ignore this model and read your own one inside the plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov You're right, it's an artifact of copy-paste, will fix, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov I believe read_model() + compile_model(model) doesn't work. I previously discussed it with GPU plguin and I think the only way to import model via CACHE_DIR is using compile_model(model_path). It has something to do with https://github.com/openvinotoolkit/openvino/pull/27162/files
Please correct me if I'm wrong, but I couldn't achieve weightless caching with CACHE_DIR and model. If I recall correctly, WEIGHTS_PATH is not set in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can enable weightless blob for already read ov::Model
s
See openvinotoolkit/openvino#29101, could you please try that PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we get increased memory consumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another attempt openvinotoolkit/openvino#29107 😄
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
…envino.genai into at/uniform-llm
static_config = { **default_config, 'STATIC_PIPELINE': 'STATEFUL' } | ||
|
||
# Test both, static and generic pipelines | ||
pipeline_configs = [default_config, static_config] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it will be renamed to npu tests
'NPUW_ONLINE_PIPELINE': 'NONE' | ||
} | get_default_llm_properties() | ||
|
||
static_config = { **default_config, 'STATIC_PIPELINE': 'STATEFUL' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need 'STATIC_PIPELINE': 'STATEFUL'
?
src/cpp/src/utils.cpp
Outdated
auto cache_mode = get_option<CacheMode>(config, ov::cache_mode.name()); | ||
// NB: OPTIMIZE_SIZE is only possible when model_path is defined! | ||
// Otherwise, switch to OPTIMIZE_SPEED | ||
if (cache_mode.has_value() && *cache_mode == CacheMode::OPTIMIZE_SIZE && !model_path.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TolyaTalamanov previous behavior was in default case we just use model_path
(weightless). And user would need to explicitly specify OPTIMIZE**_SPEED**
. Please, return to this behavior - no need for user to specify OPTIMIZE_SIZE
here, it's our default option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, changed it now, could you check?
@@ -139,8 +139,8 @@ ov::genai::LLMPipeline::LLMPipeline( | |||
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(models_path, tokenizer, scheduler_config, device, device_properties); | |||
} | |||
|
|||
if (m_pimpl == nullptr && device == "NPU") { | |||
m_pimpl = static_llm::LLMPipelineFactory::create(models_path, tokenizer, device, properties); | |||
if (m_pimpl == nullptr && device == "NPU" && properties.count("STATIC_PIPELINE")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my previous review - can we just get rid of STATIC_PIPELINE
since our only option is STATEFULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely it's not the case, but what if we have another static pipeline implementation? I plan to remove it anyway next week preferably when switching to generic pipeline doesn't bring any perf regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern here is OPTIMIZE_SPEED
and OPTIMIZE_SIZE
behavior
@TolyaTalamanov please wait for merge until I test Ilya's changes. Likely we could remove |
Tested - doesn't work out-of-the-box. I suggest we implement it separately |
No description provided.